Link Search Menu Expand Document
Start for Free

Entity Resolution

This chapter discusses Entity Resolution - one of Stardog’s features to resolve real-world entities using fuzzy matching. This feature is in Beta. This page primarily discusses what is Entity Resolution, how it works, and what are the supported operations.

Page Contents
  1. Overview
  2. How it Works
  3. Configuration
  4. Results
  5. Examples
  6. Query Results
  7. Metadata

Overview

Stardog supports operations to match real-world entities, e.g., a person using fuzzy matching techniques, known as Entity Resolution/Record Linkage/Data Matching. Entity Resolution is a technique to identify data records in a single data source or across multiple data sources that refer to the same real-world entity and to link the records together. We recommend using the external compute functionality that the Stardog platform provides for entity resolution. In-memory entity resolution is supported only for smaller graphs, i.e., having less than 100K triples, and is more appropriate for demo purposes.

Entity resolution is invoked through the entity-resolution resolve CLI command.

How it Works

Stardog requires the Database name, Select query, IRI field name in the select query results, and target graph name to run the Entity Resolution. Stardog executes the provided Select query on the database to fetch the entities represented by the IRI field name, runs the Entity resolution, and writes the results to the provided target graph. If the resulting entity of the Select query has an attribute with multiple values, please group all the values and convert them into a single comma-separated value. Please refer to the entity-resolution resolve CLI command.

Configuration

Entity resolution exposes some configuration; the user can pass this configuration as a property file to the entity-resolution resolve CLI command.

Name Description Default Value
stardog.er.similarity.threshold Two entities with a similarity score of more than this value will be considered duplicates. The value should be in the range of 0 to 1. This value will have an impact on the quality of results. 0.7
stardog.er.dataset.partition This value is only valid for entity resolution using external compute. This value will impact the performance of the Entity Resolution running on an external compute platform. Refer to spark-docs. Set this value to partition the dataset for ER pipeline. 8

Results

Entity resolution does not modify the existing data and writes the results to the provided target graph. Instead, it links the duplicate entities so users can easily query them. Users can also take the appropriate actions on them, e.g., merging the same entities by running an update query as a follow on step. For each group of detected duplicate entities, entity resolution creates a new node of type tag:stardog:api:EntityMatch and links those entities with this node by creating an edge of type tag:stardog:api:entityMatch between them.

If the user includes the score in the output, new nodes of type tag:stardog:api:EntityMatchInfo will be created for each pair of duplicates in the group to represent the similarity score (score node). A node of type tag:stardog:api:EntityMatch, which is unique for the group of duplicates, is linked to the score node by creating the edges of type tag:stardog:api:hasEntityMatchInfo, and duplicate entities linked to the score node by creating the edges as Entity1 (tag:stardog:api:Entity1) and Entity2 (tag:stardog:api:Entity2).

Examples

If we have the following duplicate entities in the graph corresponding to a single person:


# sebetian1
:Sebestian_Vincent a :Person;
:fname "Sebestian";
:lname "Vincent";
:age "24"^^xsd:integer;
:phone "9871456789".

# sebetian2
:Sebestn_Vincent a :Person;
:fname "Sebestn";
:lname "Vincent";
:age "2"^^xsd:integer;
:phone "9871456789".

# sebetian3
:Sebestn_Vint a :Person;
:fname "Sebestian";
:lname "Vint";
:age "24"^^xsd:integer;
:phone "987145678".

Resolve the entities with the default settings. Please refer to the documentation for a complete query example

    $ stardog entity-resolution resolve myDB "select * {  ?person a :Person ; :fname ?first_name }" "person" "test:myTargetNamedGraph"

You may use external compute to resolve the entities. Please refer to the documentation for a complete query example

    $ stardog entity-resolution resolve myDb "select * {  ?person a :Person ; … }" "person" "test:myTargetNamedGraph" -c my_external_compute_datasource_name

Performing entity resolution on such a graph results in an output that, when visualized, looks as shown below:

Entity Resolution results

Performing entity resolution with a score option on such a graph results in an output that, when visualized, looks as shown below:

Entity Resolution results with score

Query Results

The following SPARQL query will find duplicates and relevant scores:


select ?targetEntity ?entity {
    values ?targetEntity { :Sebestian_Vincent }
    ?targetEntity stardog:entityMatch/^stardog:entityMatch ?entity
    filter(?entity != ?targetEntity)
}





select ?targetEntity ?entity ?score {
    values ?targetEntity { :Sebestn_Vincent }
    ?targetEntity stardog:entityMatch ?match .
    ?match stardog:hasEntityMatchInfo [
        stardog:matchScore ?score ;
        stardog:entity1 ?entity1 ;
        stardog:entity2 ?entity2
    ]
    bind(if(?entity1 = ?targetEntity, ?entity2, 
         if(?entity2 = ?targetEntity, ?entity1,
            ?unbound)) as ?entity)
    filter(bound(?entity))
  
}

Metadata

Entity resolution also generates metadata triples with each entity resolution run. The following SPARQl query can be used to fetch the metadata:


SELECT ?timestamp ?slectQuery ?erType ?userWhoRanER ?scoreIncluded ?config {
					?entity a stardog:EntityMatchMetadata ;
					stardog:timestamp ?timestamp ;
					stardog:query ?slectQuery ;
					stardog:type ?erType ;
					stardog:user ?userWhoRanER ;
					stardog:score ?scoreIncluded ;
					stardog:config ?config
}