Entity Resolution
This chapter discusses Entity Resolution - one of Stardog’s features to resolve real-world entities using fuzzy matching. This feature is in Beta. This page primarily discusses what is Entity Resolution, how it works, and what are the supported operations.
Page Contents
Overview
Stardog supports operations to match real-world entities, e.g., a person using fuzzy matching techniques, known as Entity Resolution/Record Linkage/Data Matching. Entity Resolution is a technique to identify data records in a single data source or across multiple data sources that refer to the same real-world entity and to link the records together. We recommend using the external compute
functionality that the Stardog platform provides for entity resolution. In-memory entity resolution is supported only for smaller graphs, i.e., having less than 100K triples, and is more appropriate for demo purposes.
Entity resolution is invoked through the entity-resolution resolve
CLI command.
How it Works
Stardog requires the Database name, Select query, IRI field name in the select query results, and target graph name to run the Entity Resolution. Stardog executes the provided Select query on the database to fetch the entities represented by the IRI field name, runs the Entity resolution, and writes the results to the provided target graph. If the resulting entity of the Select query has an attribute with multiple values, please group all the values and convert them into a single comma-separated value. Please refer to the entity-resolution resolve
CLI command.
Configuration
Entity resolution exposes some configuration; the user can pass this configuration as a property file to the entity-resolution resolve
CLI command.
Name | Description | Default Value |
---|---|---|
stardog.er.similarity.threshold | Two entities with a similarity score of more than this value will be considered duplicates. The value should be in the range of 0 to 1. This value will have an impact on the quality of results. | 0.7 |
stardog.er.dataset.partition | This value is only valid for entity resolution using external compute. This value will impact the performance of the Entity Resolution running on an external compute platform. Refer to spark-docs. Set this value to partition the dataset for ER pipeline. | 8 |
Results
Entity resolution does not modify the existing data and writes the results to the provided target graph. Instead, it links the duplicate entities so users can easily query them. Users can also take the appropriate actions on them, e.g., merging the same entities by running an update query as a follow on step. For each group of detected duplicate entities, entity resolution creates a new node of type tag:stardog:api:EntityMatch
and links those entities with this node by creating an edge of type tag:stardog:api:entityMatch
between them.
If the user includes the score in the output, new nodes of type tag:stardog:api:EntityMatchInfo
will be created for each pair of duplicates in the group to represent the similarity score (score node). A node of type tag:stardog:api:EntityMatch
, which is unique for the group of duplicates, is linked to the score node by creating the edges of type tag:stardog:api:hasEntityMatchInfo
, and duplicate entities linked to the score node by creating the edges as Entity1 (tag:stardog:api:Entity1
) and Entity2 (tag:stardog:api:Entity2
).
If we have the following duplicate entities in the graph corresponding to a single person:
# sebetian1
:Sebestian_Vincent a :Person;
:fname "Sebestian";
:lname "Vincent";
:age "24"^^xsd:integer;
:phone "9871456789".
# sebetian2
:Sebestn_Vincent a :Person;
:fname "Sebestn";
:lname "Vincent";
:age "2"^^xsd:integer;
:phone "9871456789".
# sebetian3
:Sebestn_Vint a :Person;
:fname "Sebestian";
:lname "Vint";
:age "24"^^xsd:integer;
:phone "987145678".
Performing entity resolution on such a graph results in an output that, when visualized, looks as shown below:
Performing entity resolution with a score option on such a graph results in an output that, when visualized, looks as shown below:
Query Results
The following SPARQL query will find duplicates and relevant scores:
select ?targetEntity ?entity {
values ?targetEntity { :Sebestian_Vincent }
?targetEntity stardog:entityMatch/^stardog:entityMatch ?entity
filter(?entity != ?targetEntity)
}
select ?targetEntity ?entity ?score {
values ?targetEntity { :Sebestn_Vincent }
?targetEntity stardog:entityMatch ?match .
?match stardog:hasEntityMatchInfo [
stardog:matchScore ?score ;
stardog:entity1 ?entity1 ;
stardog:entity2 ?entity2
]
bind(if(?entity1 = ?targetEntity, ?entity2,
if(?entity2 = ?targetEntity, ?entity1,
?unbound)) as ?entity)
filter(bound(?entity))
}
Metadata
Entity resolution also generates metadata triples with each entity resolution run. The following SPARQl query can be used to fetch the metadata:
SELECT ?timestamp ?slectQuery ?erType ?userWhoRanER ?scoreIncluded ?config {
?entity a stardog:EntityMatchMetadata ;
stardog:timestamp ?timestamp ;
stardog:query ?slectQuery ;
stardog:type ?erType ;
stardog:user ?userWhoRanER ;
stardog:score ?scoreIncluded ;
stardog:config ?config
}