Similarity Search
Learn how to find similar nodes in the Knowledge Graph.
Page Contents
Often we need to find similar items in an efficient and scalable way. This is a general problem: you’re searching in a graph of nodes that represent real-world objects and the main thing you want to consider is similarity between pairs of objects. The motivating reasons you’d be doing this are varied; maybe you’re building a recommendation system or looking at data lineage or debugging problems in some business process where a problem in one object may also occur in similar objects.
Overview
To get into the details without getting bogged down, let’s explore a specific example, using the movie dataset.
Similarity search follows the same syntax and pipeline as our other machine learning models. First, you need to create a model, which holds the set of items available for search. The spa:arguments
property receives the features used for similarity calculation, while spa:predict
contains the identifier of the item.
prefix : <http://schema.org/>
prefix spa: <tag:stardog:api:analytics:>
INSERT {
graph spa:model {
:simModel a spa:SimilarityModel ;
spa:arguments (?genres ?directors ?writers ?producers ?metaCritic) ;
spa:predict ?movie .
}
}
WHERE {
SELECT
(spa:set(?genre) as ?genres)
(spa:set(?director) as ?directors)
(spa:set(?writer) as ?writers)
(spa:set(?producer) as ?producers)
?metaCritic
?movie
{
?movie :genre ?genre ;
:director ?director ;
:author ?writer ;
:productionCompany ?producer ;
:metaCritic ?metaCritic .
}
GROUP BY ?movie ?metaCritic
}
Here, we are creating a SimilarityModel
named :simModel
which takes as input the genres, directors, writers, producers and MetaCritic score for all movies in the dataset.
Using this model it’s pretty easy to find similar movies. We select a movie and its properties and pass it as input to the model. The number of similar items to return is controlled by the spa:limit
property given in spa:parameters
.
prefix : <http://schema.org/>
prefix t: <http://www.imdb.com/title/>
prefix spa: <tag:stardog:api:analytics:>
SELECT ?similarMovieLabel ?confidence
WHERE {
graph spa:model {
:simModel spa:arguments (?genres ?directors ?writers ?producers ?metaCritic) ;
spa:confidence ?confidence ;
spa:parameters [ spa:limit 5 ] ;
spa:predict ?similarMovie .
}
{ ?similarMovie rdfs:label ?similarMovieLabel }
{
SELECT
(spa:set(?genre) as ?genres)
(spa:set(?director) as ?directors)
(spa:set(?writer) as ?writers)
(spa:set(?producer) as ?producers)
?metaCritic
?movie
{
?movie :genre ?genre ;
:director ?director ;
:author ?writer ;
:productionCompany ?producer ;
:metaCritic ?metaCritic .
VALUES ?movie { t:tt0118715 } # The Big Lebowski
}
GROUP BY ?movie ?metaCritic
}
}
ORDER BY DESC(?confidence)
This query finds five movies that are similar to The Big Lebowski
and their similarity score, based on the features given through spa:arguments
.
similarMovieLabel | confidence |
---|---|
The Big Lebowski | 0.9999999999999998 |
Fargo | 0.9996443676337468 |
Blood Simple | 0.9996332068990889 |
The Man Who Wasn’t There | 0.9996019945613324 |
Barton Fink | 0.9995802728226650 |
As expected, the most similar item is the movie itself, followed by other movies from the inimitable Coen Brothers.
Just like other models, similarity search features can have any datatype: numbers, strings, sets, etc. The best representation for those features is automatically taken into account by Stardog when it calculates a similarity score.
Under the Hood
Items and their features are vectorized using feature hashing, the same technique used by our classification and regression models. This vectors are saved in a search index created using cluster pruning, an approximate search algorithm which groups items based on their similarity in order to speed up query performance.
The index is used to find the vectors with largest cosine similarity, which is the score given by spa:confidence
.
The Stardog docs describe advanced parameters which can be used to increase query performance and recall.