Entity Extraction and Linking

This page discusses entity extraction and linking in Stardog using BITES. It distinguishes entity extractors and RDF extractors. The former extract entities from text. The latter extracts RDF from text documents and may use entity extractors underneath.

Page Contents

Using Entity Extractors
1. Configuration
2. OpenNLP
3. Entities
4. Linker
5. Dictionary
SPARQL
Custom Entity Extractors
Recommended Reading

Using Entity Extractors

Configuration

Stardog will by default use the tika RDF extractor. You can modify the database configuration option docs.default.rdf.extractors to use a different built-in extractor such as text, entities, linked, and dictionary. A table containing descriptions of these extractors is provided in the Unstructured Content section.

If you’ve manually added CoreNLP extractors (discussed below), then CoreNLPMentionRDFExtractor, CoreNLPEntityLinkerRDFExtractor, and CoreNLPRelationRDFExtractor will also be available as values for docs.default.rdf.extractors.

OpenNLP

BITES, by default, uses the tika RDF extractor, which only extracts metadata from documents. Stardog can be configured to use the OpenNLP library to detect named entities mentioned in documents and optionally link those mentions to existing resources in the database.

Stardog can also be configured to use Stanford’s CoreNLP library for entity extraction, linking, and relationship extraction. More information about their configuration is available in the bites-corenlp repository.

The first step to use entity extractors is to identify the set of OpenNLP models that will be used. The following models are always required:

A tokenizer and sentence detector. OpenNLP provides models for several languages (e.g., en-token.bin and en-sent.bin)
At least one name finder model. Stardog supports both dictionary-based and custom-trained models. OpenNLP provides models for several types of entities and languages (e.g., en-ner-person.bin). We provide our own name finder models created from Wikipedia and DBPedia, which provide high recall / low precision in identifying Person, Organization, and Location types from English language documents.

All these files should be put in the same directory and, after or during database creation time, the database configuration option docs.opennlp.models.path should be set to its location.

For example, suppose you have a folder /data/stardog/opennlp with files en-token.bin, en-sent.bin, and en-ner-person.bin. The database creation CLI command would be as follows:

$ stardog-admin db create -o docs.opennlp.models.path=/data/stardog/opennlp -n movies

For consistency, model filenames should follow specific patterns:

*-token.bin for tokenizers (e.g., en-token.bin)
*-sent.bin for sentence detectors (e.g., en-sent.bin)
\*-ner-*.dict for dictionary-based name finders (e.g., dbpedia-en-ner-person.dict)
\*-ner-*.bin for custom-trained name finders (e.g., wikipedia-en-ner-organization.bin)

Entities

The entities extractor detects the mentions of named entities based on the configured models and creates RDF statements for those entities. When we are putting a document, we need to specify that we want to use a non-default extractor. We can use both the tika metadata extractor and the entities extractor at the same time.

Using the doc put CLI command:

$ stardog doc put --rdf-extractors tika,entities movies CastAwayReview.pdf

The result of entity extraction will be in a named graph where an auto-generated IRI is used for the entity:

<tag:stardog:api:docs:movies:CastAwayReview.pdf> {
    <tag:stardog:api:docs:entity:9ad311b4-ddf8-4da2-a49f-3fa8f79813c2> rdfs:label "Wilson" .
    <tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" .
    <tag:stardog:api:docs:entity:e559b828-714f-407d-aa73-7bdc39ee8014> rdfs:label "Robert Zemeckis" .
}

Linker

The linker extractor performs the same task as entities, but after the entities are extracted, it links those entities to the existing resources in the database. Linking is done by matching the mention text with the identifier and labels of existing resources in the database. This extractor requires full text search to be enabled to find the matching candidates and uses string similarity metrics to choose the best match. The commonly-used properties for labels are supported: rdfs:label, foaf:name, dc:title, skos:prefLabel and skos:altLabel.

Using the doc put CLI command:

$ stardog doc put --rdf-extractors linker movies CastAwayReview.pdf

The extraction results of linker will be similar to entities, but only contain existing resources for which a link was found. The link is available through the dc:references property.

<tag:stardog:api:docs:movies:CastAwayReview.pdf> {

    <tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" ;
        <http://purl.org/dc/terms/references> <http://www.imdb.com/name/nm0000158> .

    <tag:stardog:api:docs:entity:e559b828-714f-407d-aa73-7bdc39ee8014> rdfs:label "Robert Zemeckis" ;
        <http://purl.org/dc/terms/references> <http://www.imdb.com/name/nm0000709> .
}

Dictionary

The dictionary extractor fulfills the same purpose as linker, but instead of heuristically trying to match a mention’s text with existent resources, it uses a user-defined dictionary to perform that task. The dictionary provides a set of mappings between text and IRIs. Each mention found in the document will be searched in the dictionary and, if found, the IRIs will be added as dc:references links.

Dictionaries are .linker files, which need to be available in the folder determined by the database configuration property docs.opennlp.models.path. Stardog provides several dictionaries created from Wikipedia and DBPedia, which allow users to automatically link entity mentions to IRIs in those knowledge bases.

$ stardog doc put --rdf-extractors dictionary movies CastAwayReview.pdf

When using the dictionary option, all .linker files in the docs.opennlp.models.path folder will be used. The output follows the same syntax as linker.

<tag:stardog:api:docs:movies:CastAwayReview.pdf> {

    <tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" ;
        <http://purl.org/dc/terms/references> <http://en.wikipedia.org/wiki/Tom_Hanks> ;
        <http://purl.org/dc/terms/references> <http://dbpedia.org/resource/Tom_Hanks> .
}

User-defined dictionaries can be created programmatically. For example, the Java class below will create a dictionary that links every mention of Tom Hanks to two IRIs.

import java.io.File;
import java.io.IOException;
import com.complexible.stardog.docs.nlp.impl.DictionaryLinker;
import com.google.common.collect.ImmutableMultimap;
import com.stardog.stark.model.IRI;
import static ccom.stardog.stark.Values.iri;

public class CreateLinker {

    public static void main(String[] args) throws IOException {
        ImmutableMultimap<String, IRI> aDictionary = ImmutableMultimap.<String, IRI>builder()
                                                         .putAll("Tom Hanks", iri("https://en.wikipedia.org/wiki/Tom_Hanks"), iri("http://www.imdb.com/name/nm0000158"))
                                                         .build();

        DictionaryLinker.Linker aLinker = new DictionaryLinker.Linker(aDictionary);

        aLinker.to(new File("/data/stardog/opennlp/TomHanks.linker"));
    }
}

SPARQL

The entities, linker, and dictionary extractors are also available as a SPARQL service, which makes them applicable to any data in the graph, whether stored directly in Stardog or accessed remotely on SPARQL endpoints or virtual graphs.

prefix docs: <tag:stardog:api:docs:>

select * {
  ?review :content ?text

  service docs:entityExtractor {
    []  docs:text ?text ;
        docs:mention ?mention
  }
}

The entities extractor is accessed by using the docs:entityExtractor service, which receives one input argument, docs:text, with the text to be analyzed. The output will be the extracted named entity mentions, bound to the variable given in the docs:mention property.

+-----------------------------------------------------------------------------------+------------------+---------------+
|                                       text                                        |      mention     |     review    |
+-----------------------------------------------------------------------------------+------------------+---------------+
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Robert Zemeckis"| :MovieReview  |
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Tom Hanks"      | :MovieReview  |
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Wilson"         | :MovieReview  |
+-----------------------------------------------------------------------------------+------------------+---------------+

By adding an extra output variable, docs:entity, the linker extractor will be used instead.

prefix docs: <tag:stardog:api:docs:>

select * {
  ?review :content ?text

  service docs:entityExtractor {
    []  docs:text ?text ;
        docs:mention ?mention ;
        docs:entity ?entity
  }
}

+-------------------------+------------------+----------------+---------------+
|         text            |      mention     |     entity     |     review    |
+-------------------------+------------------+----------------+---------------+
| "Directed by Robert..." | "Tom Hanks"      | imdb:nm0000158 | :MovieReview  |
| "Directed by Robert..." | "Robert Zemeckis"| imdb:nm0000709 | :MovieReview  |
+-------------------------+------------------+----------------+---------------+

The dictionary extractor is called in a similar way to linker, with an extra argument docs:mode set to docs:Dictionary.

prefix docs: <tag:stardog:api:docs:>

select * {
  ?review :content ?text

  service docs:entityExtractor {
    []  docs:text ?text ;
        docs:mention ?mention ;
        docs:entity ?entity ;
        docs:mode docs:Dictionary
  }
}

+-------------------------+------------------+---------------------+---------------+
|         text            |      mention     |        entity       |     review    |
+-------------------------+------------------+---------------------+---------------+
| "Directed by Robert..." | "Tom Hanks"      | imdb:nm0000158      | :MovieReview  |
| "Directed by Robert..." | "Tom Hanks"      | wikipedia:Tom_Hanks | :MovieReview  |
+-------------------------+------------------+---------------------+---------------+

All extractors accept one more output variable, docs:type, which will output the type of entity (e.g., Person, Organization), when available.

prefix docs: <tag:stardog:api:docs:>

select * {
  ?review :content ?text

  service docs:entityExtractor {
    []  docs:text ?text ;
        docs:mention ?mention ;
        docs:entity ?entity ;
        docs:type ?type
  }
}

+-------------------------+------------------+----------------+-----------+---------------+
|         text            |      mention     |     entity     |    type   |     review    |
+-------------------------+------------------+----------------+-----------+---------------+
| "Directed by Robert..." | "Tom Hanks"      | imdb:nm0000158 | :Person   | :MovieReview  |
| "Directed by Robert..." | "Robert Zemeckis"| imdb:nm0000709 | :Person   | :MovieReview  |
+-------------------------+------------------+----------------+-----------+---------------+

Custom Entity Extractors

In addition to the built-in extractors, Stardog also supports custom entity extractors (analogously to custom RDF extractors). The docs:entityExtractor service accepts additional triple patterns to specify and configure the required extractor:

prefix docs: <tag:stardog:api:docs:>

select * {
  ?review :content ?text

  service docs:entityExtractor {
    []  docs:text ?text ;
        docs:mention ?mention ;
        docs:extractor <urn:user:defined:extractor> ;
        docs:ud:config "configuration" 
  }
}

The docs:extractor predicate specifies the URI of a registered extractor. To register an extractor, one adds a factory class by subclassing EntityExtractorFactory. The factory specifies the extractor’s URI and the name (used in query plans) via its methods. It is then registered with the Java ServiceLoader by putting its class name in a file called com.complexible.stardog.docs.nlp.EntityExtractorFactory inside the META-INF/services directory in Stardog’s classpath. The jar with the extractor and its factory should also be in the classpath.

The custom extractor can be configured when invoked in a query. Its factory will get the SPARQL pattern specified inside the service and can read any required information before instantiating the extractor. In the example above, the docs:ud:config predicate is used to configure the extractor.