Similarity Search

Learn how to find similar nodes in the Knowledge Graph.

Page Contents

Overview
Under the Hood

Often we need to find similar items in an efficient and scalable way. This is a general problem: you’re searching in a graph of nodes that represent real-world objects and the main thing you want to consider is similarity between pairs of objects. The motivating reasons you’d be doing this are varied; maybe you’re building a recommendation system or looking at data lineage or debugging problems in some business process where a problem in one object may also occur in similar objects.

Overview

To get into the details without getting bogged down, let’s explore a specific example, using the movie dataset.

Similarity search follows the same syntax and pipeline as our other machine learning models. First, you need to create a model, which holds the set of items available for search. The spa:arguments property receives the features used for similarity calculation, while spa:predict contains the identifier of the item.

prefix : <http://schema.org/>
prefix spa: <tag:stardog:api:analytics:>

INSERT {
    graph spa:model {
        :simModel a spa:SimilarityModel ;
                  spa:arguments (?genres ?directors ?writers ?producers ?metaCritic) ;
                  spa:predict ?movie .
    }
}
WHERE {
    SELECT
    (spa:set(?genre) as ?genres)
    (spa:set(?director) as ?directors)
    (spa:set(?writer) as ?writers)
    (spa:set(?producer) as ?producers)
    ?metaCritic
    ?movie
    {
        ?movie  :genre ?genre ;
                :director ?director ;
                :author ?writer ;
                :productionCompany ?producer ;
                :metaCritic ?metaCritic .
    }
    GROUP BY ?movie ?metaCritic
}

Here, we are creating a SimilarityModel named :simModel which takes as input the genres, directors, writers, producers and MetaCritic score for all movies in the dataset.

Using this model it’s pretty easy to find similar movies. We select a movie and its properties and pass it as input to the model. The number of similar items to return is controlled by the spa:limit property given in spa:parameters.

prefix : <http://schema.org/>
prefix t: <http://www.imdb.com/title/>
prefix spa: <tag:stardog:api:analytics:>

SELECT ?similarMovieLabel ?confidence
WHERE {
    graph spa:model {
      :simModel spa:arguments (?genres ?directors ?writers ?producers ?metaCritic) ;
                spa:confidence ?confidence ;
                spa:parameters [ spa:limit 5 ] ;
                spa:predict ?similarMovie .
    }

    { ?similarMovie rdfs:label ?similarMovieLabel }

    {
        SELECT
        (spa:set(?genre) as ?genres)
        (spa:set(?director) as ?directors)
        (spa:set(?writer) as ?writers)
        (spa:set(?producer) as ?producers)
        ?metaCritic
        ?movie
        {
            ?movie  :genre ?genre ;
                    :director ?director ;
                    :author ?writer ;
                    :productionCompany ?producer ;
                    :metaCritic ?metaCritic .

            VALUES ?movie { t:tt0118715 } # The Big Lebowski
        }
        GROUP BY ?movie ?metaCritic
    }
}

ORDER BY DESC(?confidence)

This query finds five movies that are similar to The Big Lebowski and their similarity score, based on the features given through spa:arguments.

similarMovieLabel	confidence
The Big Lebowski	0.9999999999999998
Fargo	0.9996443676337468
Blood Simple	0.9996332068990889
The Man Who Wasn’t There	0.9996019945613324
Barton Fink	0.9995802728226650

As expected, the most similar item is the movie itself, followed by other movies from the inimitable Coen Brothers.

Just like other models, similarity search features can have any datatype: numbers, strings, sets, etc. The best representation for those features is automatically taken into account by Stardog when it calculates a similarity score.

Under the Hood

Items and their features are vectorized using feature hashing, the same technique used by our classification and regression models. This vectors are saved in a search index created using cluster pruning, an approximate search algorithm which groups items based on their similarity in order to speed up query performance.

The index is used to find the vectors with largest cosine similarity, which is the score given by spa:confidence.

The Stardog docs describe advanced parameters which can be used to increase query performance and recall.

Overview
Under the Hood