Entity Extraction and Linking
This page discusses entity extraction and linking in Stardog using BITES. It distinguishes entity extractors and RDF extractors. The former extract entities from text. The latter extracts RDF from text documents and may use entity extractors underneath.
Page Contents
Using Entity Extractors
Configuration
Stardog will by default use the tika
RDF extractor. You can modify the database configuration option docs.default.rdf.extractors
to use a different built-in extractor such as text
, entities
, linked
, and dictionary
. A table containing descriptions of these extractors is provided in the Unstructured Content section.
If you’ve manually added CoreNLP extractors (discussed below), then CoreNLPMentionRDFExtractor
, CoreNLPEntityLinkerRDFExtractor
, and CoreNLPRelationRDFExtractor
will also be available as values for docs.default.rdf.extractors
.
OpenNLP
BITES, by default, uses the tika
RDF extractor, which only extracts metadata from documents. Stardog can be configured to use the OpenNLP library to detect named entities mentioned in documents and optionally link those mentions to existing resources in the database.
Stardog can also be configured to use Stanford’s CoreNLP library for entity extraction, linking, and relationship extraction. More information about their configuration is available in the bites-corenlp repository.
The first step to use entity extractors is to identify the set of OpenNLP models that will be used. The following models are always required:
- A tokenizer and sentence detector. OpenNLP provides models for several languages (e.g.,
en-token.bin
anden-sent.bin
) - At least one name finder model. Stardog supports both dictionary-based and custom-trained models. OpenNLP provides models for several types of entities and languages (e.g.,
en-ner-person.bin
). We provide our own name finder models created from Wikipedia and DBPedia, which provide high recall / low precision in identifying Person, Organization, and Location types from English language documents.
All these files should be put in the same directory and, after or during database creation time, the database configuration option docs.opennlp.models.path
should be set to its location.
For example, suppose you have a folder /data/stardog/opennlp
with files en-token.bin
, en-sent.bin
, and en-ner-person.bin
. The database creation CLI command would be as follows:
$ stardog-admin db create -o docs.opennlp.models.path=/data/stardog/opennlp -n movies
For consistency, model filenames should follow specific patterns:
*-token.bin
for tokenizers (e.g.,en-token.bin
)*-sent.bin
for sentence detectors (e.g.,en-sent.bin
)\*-ner-*.dict
for dictionary-based name finders (e.g.,dbpedia-en-ner-person.dict
)\*-ner-*.bin
for custom-trained name finders (e.g.,wikipedia-en-ner-organization.bin
)
Entities
The entities
extractor detects the mentions of named entities based on the configured models and creates RDF statements for those entities. When we are putting a document, we need to specify that we want to use a non-default extractor. We can use both the tika
metadata extractor and the entities
extractor at the same time.
Using the doc put
CLI command:
$ stardog doc put --rdf-extractors tika,entities movies CastAwayReview.pdf
The result of entity extraction will be in a named graph where an auto-generated IRI is used for the entity:
<tag:stardog:api:docs:movies:CastAwayReview.pdf> {
<tag:stardog:api:docs:entity:9ad311b4-ddf8-4da2-a49f-3fa8f79813c2> rdfs:label "Wilson" .
<tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" .
<tag:stardog:api:docs:entity:e559b828-714f-407d-aa73-7bdc39ee8014> rdfs:label "Robert Zemeckis" .
}
Linker
The linker
extractor performs the same task as entities
, but after the entities are extracted, it links those entities to the existing resources in the database. Linking is done by matching the mention text with the identifier and labels of existing resources in the database. This extractor requires full text search to be enabled to find the matching candidates and uses string similarity metrics to choose the best match. The commonly-used properties for labels are supported: rdfs:label
, foaf:name
, dc:title
, skos:prefLabel
and skos:altLabel
.
Using the doc put
CLI command:
$ stardog doc put --rdf-extractors linker movies CastAwayReview.pdf
The extraction results of linker
will be similar to entities
, but only contain existing resources for which a link was found. The link is available through the dc:references
property.
<tag:stardog:api:docs:movies:CastAwayReview.pdf> {
<tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" ;
<http://purl.org/dc/terms/references> <http://www.imdb.com/name/nm0000158> .
<tag:stardog:api:docs:entity:e559b828-714f-407d-aa73-7bdc39ee8014> rdfs:label "Robert Zemeckis" ;
<http://purl.org/dc/terms/references> <http://www.imdb.com/name/nm0000709> .
}
Dictionary
The dictionary
extractor fulfills the same purpose as linker
, but instead of heuristically trying to match a mention’s text with existent resources, it uses a user-defined dictionary to perform that task. The dictionary provides a set of mappings between text and IRIs. Each mention found in the document will be searched in the dictionary and, if found, the IRIs will be added as dc:references
links.
Dictionaries are .linker
files, which need to be available in the folder determined by the database configuration property docs.opennlp.models.path
. Stardog provides several dictionaries created from Wikipedia and DBPedia, which allow users to automatically link entity mentions to IRIs in those knowledge bases.
$ stardog doc put --rdf-extractors dictionary movies CastAwayReview.pdf
When using the dictionary
option, all .linker
files in the docs.opennlp.models.path
folder will be used. The output follows the same syntax as linker
.
<tag:stardog:api:docs:movies:CastAwayReview.pdf> {
<tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" ;
<http://purl.org/dc/terms/references> <http://en.wikipedia.org/wiki/Tom_Hanks> ;
<http://purl.org/dc/terms/references> <http://dbpedia.org/resource/Tom_Hanks> .
}
User-defined dictionaries can be created programmatically. For example, the Java class below will create a dictionary that links every mention of Tom Hanks
to two IRIs.
import java.io.File;
import java.io.IOException;
import com.complexible.stardog.docs.nlp.impl.DictionaryLinker;
import com.google.common.collect.ImmutableMultimap;
import com.stardog.stark.model.IRI;
import static ccom.stardog.stark.Values.iri;
public class CreateLinker {
public static void main(String[] args) throws IOException {
ImmutableMultimap<String, IRI> aDictionary = ImmutableMultimap.<String, IRI>builder()
.putAll("Tom Hanks", iri("https://en.wikipedia.org/wiki/Tom_Hanks"), iri("http://www.imdb.com/name/nm0000158"))
.build();
DictionaryLinker.Linker aLinker = new DictionaryLinker.Linker(aDictionary);
aLinker.to(new File("/data/stardog/opennlp/TomHanks.linker"));
}
}
SPARQL
The entities
, linker
, and dictionary
extractors are also available as a SPARQL service, which makes them applicable to any data in the graph, whether stored directly in Stardog or accessed remotely on SPARQL endpoints or virtual graphs.
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention
}
}
The entities
extractor is accessed by using the docs:entityExtractor
service, which receives one input argument, docs:text
, with the text to be analyzed. The output will be the extracted named entity mentions, bound to the variable given in the docs:mention
property.
+-----------------------------------------------------------------------------------+------------------+---------------+
| text | mention | review |
+-----------------------------------------------------------------------------------+------------------+---------------+
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Robert Zemeckis"| :MovieReview |
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Tom Hanks" | :MovieReview |
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Wilson" | :MovieReview |
+-----------------------------------------------------------------------------------+------------------+---------------+
By adding an extra output variable, docs:entity
, the linker
extractor will be used instead.
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention ;
docs:entity ?entity
}
}
+-------------------------+------------------+----------------+---------------+
| text | mention | entity | review |
+-------------------------+------------------+----------------+---------------+
| "Directed by Robert..." | "Tom Hanks" | imdb:nm0000158 | :MovieReview |
| "Directed by Robert..." | "Robert Zemeckis"| imdb:nm0000709 | :MovieReview |
+-------------------------+------------------+----------------+---------------+
The dictionary
extractor is called in a similar way to linker
, with an extra argument docs:mode
set to docs:Dictionary
.
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention ;
docs:entity ?entity ;
docs:mode docs:Dictionary
}
}
+-------------------------+------------------+---------------------+---------------+
| text | mention | entity | review |
+-------------------------+------------------+---------------------+---------------+
| "Directed by Robert..." | "Tom Hanks" | imdb:nm0000158 | :MovieReview |
| "Directed by Robert..." | "Tom Hanks" | wikipedia:Tom_Hanks | :MovieReview |
+-------------------------+------------------+---------------------+---------------+
All extractors accept one more output variable, docs:type
, which will output the type of entity (e.g., Person, Organization), when available.
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention ;
docs:entity ?entity ;
docs:type ?type
}
}
+-------------------------+------------------+----------------+-----------+---------------+
| text | mention | entity | type | review |
+-------------------------+------------------+----------------+-----------+---------------+
| "Directed by Robert..." | "Tom Hanks" | imdb:nm0000158 | :Person | :MovieReview |
| "Directed by Robert..." | "Robert Zemeckis"| imdb:nm0000709 | :Person | :MovieReview |
+-------------------------+------------------+----------------+-----------+---------------+
Custom Entity Extractors
In addition to the built-in extractors, Stardog also supports custom entity extractors (analogously to custom RDF extractors). The docs:entityExtractor
service accepts additional triple patterns to specify and configure the required extractor:
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention ;
docs:extractor <urn:user:defined:extractor> ;
docs:ud:config "configuration"
}
}
The docs:extractor
predicate specifies the URI of a registered extractor. To register an extractor, one adds a factory class by subclassing EntityExtractorFactory
. The factory specifies the extractor’s URI and the name (used in query plans) via its methods. It is then registered with the Java ServiceLoader by putting its class name in a file called com.complexible.stardog.docs.nlp.EntityExtractorFactory
inside the META-INF/services
directory in Stardog’s classpath. The jar with the extractor and its factory should also be in the classpath.
The custom extractor can be configured when invoked in a query. Its factory will get the SPARQL pattern specified inside the service and can read any required information before instantiating the extractor. In the example above, the docs:ud:config
predicate is used to configure the extractor.
Recommended Reading
Learn more about entity linking and extraction in Stardog in the following blog posts: