Entity Extraction and Linking
This page discusses entity extraction and linking in Stardog using BITES.
Page Contents
Using Entity Extractors
Configuration
Stardog will by default use the tika
RDF extractor. You can modify the database configuration option docs.default.rdf.extractors
to use a different built in extractor such as text
,entities
, linked
, and dictionary
. A table containing descriptions of these extractors is provided in the Unstructured Content section.
If you’ve manually added CoreNLP extractors discussed just below then the CoreNLPMentionRDFExtractor
, CoreNLPEntityLinkerRDFExtractor
, and CoreNLPRelationRDFExtractor
these will also be available to supply as a value for the database configuration option.
OpenNLP
BITES, by default, uses the tika
RDF extractor that only extracts metadata from documents. Stardog can be configured to use the OpenNLP library to detect named entities mentioned in documents and optionally link those mentions to existing resources in the database.
Stardog can also be configured to use Stanford’s CoreNLP library for entity extraction, linking, and relationship extraction. More information about their configuration is available in the bites-corenlp repository.
The first step to use entity extractors is to identify the set of OpenNLP models that will be used. The following models are always required:
- A tokenizer and sentence detector. OpenNLP provides models for several languages (e.g.,
en-token.bin
anden-sent.bin
) - At least one name finder model. Stardog supports both dictionary-based and custom trained models. OpenNLP provides models for several types of entities and languages (e.g.,
en-ner-person.bin
). We provide our own name finder models created from Wikipedia and DBPedia, which provide high recall / low precision in identifying Person, Organization, and Location types from English language documents.
All these files should be put in the same directory and, after or during database creation time, the database configuration option docs.opennlp.models.path
should be set to its location.
For example, suppose you have a folder /data/stardog/opennlp
with files en-token.bin
, en-sent.bin
, and en-ner-person.bin
. The database creation CLI command would be as follows:
$ stardog-admin db create -o docs.opennlp.models.path=/data/stardog/opennlp -n movies
For consistency, model filenames should follow specific patterns:
*-token.bin
for tokenizers (e.g.,en-token.bin
)*-sent.bin
for sentence detectors (e.g.,en-sent.bin
)\*-ner-*.dict
for dictionary-based name finders (e.g.,dbpedia-en-ner-person.dict
)\*-ner-*.bin
for custom trained name finders (e.g.,wikipedia-en-ner-organization.bin
)
Entities
The entities
extractor detects the mentions of named entities based on the configured models and creates RDF statements for those entities. When we are putting a document we need to specify that we want to use a non-default extractor. We can use both the tika
metadata extractor and the entities
extractor at the same time:
Using the doc put
CLI command:
$ stardog doc put --rdf-extractors tika,entities movies CastAwayReview.pdf
The result of entity extraction will be in a named graph where an auto-generated IRI is used for the entity:
<tag:stardog:api:docs:movies:CastAwayReview.pdf> {
<tag:stardog:api:docs:entity:9ad311b4-ddf8-4da2-a49f-3fa8f79813c2> rdfs:label "Wilson" .
<tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" .
<tag:stardog:api:docs:entity:e559b828-714f-407d-aa73-7bdc39ee8014> rdfs:label "Robert Zemeckis" .
}
Linker
The linker
extractor performs the same task as entities
but after the entities are extracted it links those entities to the existing resources in the database. Linking is done by matching the mention text with the identifier and labels of existing resources in the database. This extractor requires full text search to be enabled to find the matching candidates and uses string similarity metrics to choose the best match. The commonly used properties for labels are supported: rdfs:label
, foaf:name
, dc:title
, skos:prefLabel
and skos:altLabel
.
Using the doc put
CLI command:
$ stardog doc put --rdf-extractors linker movies CastAwayReview.pdf
The extraction results of linker
will be similar to entities
, but only contain existing resources for which a link was found. The link is available through the dc:references
property.
<tag:stardog:api:docs:movies:CastAwayReview.pdf> {
<tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" ;
<http://purl.org/dc/terms/references> <http://www.imdb.com/name/nm0000158> .
<tag:stardog:api:docs:entity:e559b828-714f-407d-aa73-7bdc39ee8014> rdfs:label "Robert Zemeckis" ;
<http://purl.org/dc/terms/references> <http://www.imdb.com/name/nm0000709> .
}
Dictionary
The dictionary
extractor fullfills the same purpose as the linker
, but instead of heuristically trying to match a mention’s text with existent resources, it uses a user-defined dictionary to perform that task. The dictionary provides a set of mappings between text and IRIs. Each mention found in the document will be searched in the dictionary and, if found, the IRIs will be added as dc:references
links.
Dictionaries are .linker
files, which need to be available in the folder determined by the database configuration property docs.opennlp.models.path
. Stardog provides several dictionaries created from Wikipedia and DBPedia, which allow users to automatically link entity mentions to IRIs in those knowledge bases.
$ stardog doc put --rdf-extractors dictionary movies CastAwayReview.pdf
When using the dictionary
option, all .linker
files in the docs.opennlp.models.path
folder will be used. The output follows the same syntax as the linker
.
<tag:stardog:api:docs:movies:CastAwayReview.pdf> {
<tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" ;
<http://purl.org/dc/terms/references> <http://en.wikipedia.org/wiki/Tom_Hanks> ;
<http://purl.org/dc/terms/references> <http://dbpedia.org/resource/Tom_Hanks> .
}
User-defined dictionaries can be created programmatically. For example, the Java class below will create a dictionary that links every mention of Tom Hanks
to two IRIs.
import java.io.File;
import java.io.IOException;
import com.complexible.stardog.docs.nlp.impl.DictionaryLinker;
import com.google.common.collect.ImmutableMultimap;
import com.stardog.stark.model.IRI;
import static ccom.stardog.stark.Values.iri;
public class CreateLinker {
public static void main(String[] args) throws IOException {
ImmutableMultimap<String, IRI> aDictionary = ImmutableMultimap.<String, IRI>builder()
.putAll("Tom Hanks", iri("https://en.wikipedia.org/wiki/Tom_Hanks"), iri("http://www.imdb.com/name/nm0000158"))
.build();
DictionaryLinker.Linker aLinker = new DictionaryLinker.Linker(aDictionary);
aLinker.to(new File("/data/stardog/opennlp/TomHanks.linker"));
}
}
SPARQL
The entities
, linker
, and dictionary
extractors are also available as a SPARQL service, which makes them applicable to any data in the graph, whether stored directly in Stardog or accessed remotely on SPARQL endpoints or virtual graphs.
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention
}
}
The entities
extractor is accessed by using the docs:entityExtractor
service, which receives one input argument, docs:text
, with the text to be analyzed. The output will be the extracted named entity mentions, bound to the variable given in the docs:mention
property.
+-----------------------------------------------------------------------------------+------------------+---------------+
| text | mention | review |
+-----------------------------------------------------------------------------------+------------------+---------------+
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Robert Zemeckis"| :MovieReview |
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Tom Hanks" | :MovieReview |
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Wilson" | :MovieReview |
+-----------------------------------------------------------------------------------+------------------+---------------+
By adding an extra output variable, docs:entity
, the linker
extractor will be used instead.
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention ;
docs:entity ?entity
}
}
+-------------------------+------------------+----------------+---------------+
| text | mention | entity | review |
+-------------------------+------------------+----------------+---------------+
| "Directed by Robert..." | "Tom Hanks" | imdb:nm0000158 | :MovieReview |
| "Directed by Robert..." | "Robert Zemeckis"| imdb:nm0000709 | :MovieReview |
+-------------------------+------------------+----------------+---------------+
The dictionary
extractor is called in a similar way to linker
, with an extra argument docs:mode
set to docs:Dictionary
.
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention ;
docs:entity ?entity ;
docs:mode docs:Dictionary
}
}
+-------------------------+------------------+---------------------+---------------+
| text | mention | entity | review |
+-------------------------+------------------+---------------------+---------------+
| "Directed by Robert..." | "Tom Hanks" | imdb:nm0000158 | :MovieReview |
| "Directed by Robert..." | "Tom Hanks" | wikipedia:Tom_Hanks | :MovieReview |
+-------------------------+------------------+---------------------+---------------+
All extractors accept one more output variable, docs:type
, which will output the type of entity (e.g., Person, Organization), when available.
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention ;
docs:entity ?entity ;
docs:type ?type
}
}
+-------------------------+------------------+----------------+-----------+---------------+
| text | mention | entity | type | review |
+-------------------------+------------------+----------------+-----------+---------------+
| "Directed by Robert..." | "Tom Hanks" | imdb:nm0000158 | :Person | :MovieReview |
| "Directed by Robert..." | "Robert Zemeckis"| imdb:nm0000709 | :Person | :MovieReview |
+-------------------------+------------------+----------------+-----------+---------------+
Custom Extractors
The included extractors are intentionally basic, especially when compared to machine learning or text mining algorithms. A custom extractor connects the document store to algorithms tailored specifically to your data. The extractor SPI allows integration of any arbitrary workflow or algorithm from NLP methods like part-of-speech tagging, entity recognition, relationship learning, or sentiment analysis to machine learning models such as document ranking and clustering.
Extracted RDF assertions are stored in a named graph specific to the document, allowing provenance tracking and versatile querying. The extractor must implement the RDFExtractor
interface. A convenience class, TextProvidingRDFExtractor
, extracts the text from the document before calling the extractor. In addition, AbstractEntityRDFExtractor
- or one of its existing subclasses - extends TextProvidingRDFExtractor
so you can customize entity linking extraction to your specific needs.
The text extractor SPI gives you the opportunity to support arbitrary document formats. Implementations will be given a raw document and be expected to extract a string of text which will be added to the full-text search index. Text extractors should implement the TextProvidingRDFExtractor
interface.
Custom extractors are registered with the Java ServiceLoader under the RDFExtractor
or TextProvidingRDFExtractor
class names. Custom extractors can be referred to from the command line or APIs by their fully qualified or “simple” class names.
For an example of a custom extractor, see our github repository.
Recomended Reading:
The following blog posts can aid comprehension around entity linking and extraction in Stardog: