Link Search Menu Expand Document
Start for Free

Entity Extraction and Linking

This page discusses entity extraction and linking in Stardog using BITES.

Page Contents
  1. Using Entity Extractors
    1. Configuration
    2. OpenNLP
    3. Entities
    4. Linker
    5. Dictionary
  2. SPARQL
  3. Custom Extractors
  4. Recomended Reading:

Using Entity Extractors

Configuration

Stardog will by default use the tika RDF extractor. You can modify the database configuration option docs.default.rdf.extractors to use a different built in extractor such as text,entities, linked, and dictionary. A table containing descriptions of these extractors is provided in the Unstructured Content section.

If you’ve manually added CoreNLP extractors discussed just below then the CoreNLPMentionRDFExtractor, CoreNLPEntityLinkerRDFExtractor, and CoreNLPRelationRDFExtractor these will also be available to supply as a value for the database configuration option.

OpenNLP

BITES, by default, uses the tika RDF extractor that only extracts metadata from documents. Stardog can be configured to use the OpenNLP library to detect named entities mentioned in documents and optionally link those mentions to existing resources in the database.

Stardog can also be configured to use Stanford’s CoreNLP library for entity extraction, linking, and relationship extraction. More information about their configuration is available in the bites-corenlp repository.

The first step to use entity extractors is to identify the set of OpenNLP models that will be used. The following models are always required:

  • A tokenizer and sentence detector. OpenNLP provides models for several languages (e.g., en-token.bin and en-sent.bin)
  • At least one name finder model. Stardog supports both dictionary-based and custom trained models. OpenNLP provides models for several types of entities and languages (e.g., en-ner-person.bin). We provide our own name finder models created from Wikipedia and DBPedia, which provide high recall / low precision in identifying Person, Organization, and Location types from English language documents.

All these files should be put in the same directory and, after or during database creation time, the database configuration option docs.opennlp.models.path should be set to its location.

For example, suppose you have a folder /data/stardog/opennlp with files en-token.bin, en-sent.bin, and en-ner-person.bin. The database creation CLI command would be as follows:

$ stardog-admin db create -o docs.opennlp.models.path=/data/stardog/opennlp -n movies

For consistency, model filenames should follow specific patterns:

  • *-token.bin for tokenizers (e.g., en-token.bin)
  • *-sent.bin for sentence detectors (e.g., en-sent.bin)
  • \*-ner-*.dict for dictionary-based name finders (e.g., dbpedia-en-ner-person.dict)
  • \*-ner-*.bin for custom trained name finders (e.g., wikipedia-en-ner-organization.bin)

Entities

The entities extractor detects the mentions of named entities based on the configured models and creates RDF statements for those entities. When we are putting a document we need to specify that we want to use a non-default extractor. We can use both the tika metadata extractor and the entities extractor at the same time:

Using the doc put CLI command:

$ stardog doc put --rdf-extractors tika,entities movies CastAwayReview.pdf

The result of entity extraction will be in a named graph where an auto-generated IRI is used for the entity:

<tag:stardog:api:docs:movies:CastAwayReview.pdf> {
    <tag:stardog:api:docs:entity:9ad311b4-ddf8-4da2-a49f-3fa8f79813c2> rdfs:label "Wilson" .
    <tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" .
    <tag:stardog:api:docs:entity:e559b828-714f-407d-aa73-7bdc39ee8014> rdfs:label "Robert Zemeckis" .
}

Linker

The linker extractor performs the same task as entities but after the entities are extracted it links those entities to the existing resources in the database. Linking is done by matching the mention text with the identifier and labels of existing resources in the database. This extractor requires full text search to be enabled to find the matching candidates and uses string similarity metrics to choose the best match. The commonly used properties for labels are supported: rdfs:label, foaf:name, dc:title, skos:prefLabel and skos:altLabel.

Using the doc put CLI command:

$ stardog doc put --rdf-extractors linker movies CastAwayReview.pdf

The extraction results of linker will be similar to entities, but only contain existing resources for which a link was found. The link is available through the dc:references property.

<tag:stardog:api:docs:movies:CastAwayReview.pdf> {

    <tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" ;
        <http://purl.org/dc/terms/references> <http://www.imdb.com/name/nm0000158> .

    <tag:stardog:api:docs:entity:e559b828-714f-407d-aa73-7bdc39ee8014> rdfs:label "Robert Zemeckis" ;
        <http://purl.org/dc/terms/references> <http://www.imdb.com/name/nm0000709> .
}

Dictionary

The dictionary extractor fullfills the same purpose as the linker, but instead of heuristically trying to match a mention’s text with existent resources, it uses a user-defined dictionary to perform that task. The dictionary provides a set of mappings between text and IRIs. Each mention found in the document will be searched in the dictionary and, if found, the IRIs will be added as dc:references links.

Dictionaries are .linker files, which need to be available in the folder determined by the database configuration property docs.opennlp.models.path. Stardog provides several dictionaries created from Wikipedia and DBPedia, which allow users to automatically link entity mentions to IRIs in those knowledge bases.

$ stardog doc put --rdf-extractors dictionary movies CastAwayReview.pdf

When using the dictionary option, all .linker files in the docs.opennlp.models.path folder will be used. The output follows the same syntax as the linker.

<tag:stardog:api:docs:movies:CastAwayReview.pdf> {

    <tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" ;
        <http://purl.org/dc/terms/references> <http://en.wikipedia.org/wiki/Tom_Hanks> ;
        <http://purl.org/dc/terms/references> <http://dbpedia.org/resource/Tom_Hanks> .
}

User-defined dictionaries can be created programmatically. For example, the Java class below will create a dictionary that links every mention of Tom Hanks to two IRIs.

import java.io.File;
import java.io.IOException;
import com.complexible.stardog.docs.nlp.impl.DictionaryLinker;
import com.google.common.collect.ImmutableMultimap;
import com.stardog.stark.model.IRI;
import static ccom.stardog.stark.Values.iri;

public class CreateLinker {

    public static void main(String[] args) throws IOException {
        ImmutableMultimap<String, IRI> aDictionary = ImmutableMultimap.<String, IRI>builder()
                                                         .putAll("Tom Hanks", iri("https://en.wikipedia.org/wiki/Tom_Hanks"), iri("http://www.imdb.com/name/nm0000158"))
                                                         .build();

        DictionaryLinker.Linker aLinker = new DictionaryLinker.Linker(aDictionary);

        aLinker.to(new File("/data/stardog/opennlp/TomHanks.linker"));
    }
}

SPARQL

The entities, linker, and dictionary extractors are also available as a SPARQL service, which makes them applicable to any data in the graph, whether stored directly in Stardog or accessed remotely on SPARQL endpoints or virtual graphs.

prefix docs: <tag:stardog:api:docs:>

select * {
  ?review :content ?text

  service docs:entityExtractor {
    []  docs:text ?text ;
        docs:mention ?mention
  }
}

The entities extractor is accessed by using the docs:entityExtractor service, which receives one input argument, docs:text, with the text to be analyzed. The output will be the extracted named entity mentions, bound to the variable given in the docs:mention property.

+-----------------------------------------------------------------------------------+------------------+---------------+
|                                       text                                        |      mention     |     review    |
+-----------------------------------------------------------------------------------+------------------+---------------+
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Robert Zemeckis"| :MovieReview  |
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Tom Hanks"      | :MovieReview  |
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Wilson"         | :MovieReview  |
+-----------------------------------------------------------------------------------+------------------+---------------+

By adding an extra output variable, docs:entity, the linker extractor will be used instead.

prefix docs: <tag:stardog:api:docs:>

select * {
  ?review :content ?text

  service docs:entityExtractor {
    []  docs:text ?text ;
        docs:mention ?mention ;
        docs:entity ?entity
  }
}
+-------------------------+------------------+----------------+---------------+
|         text            |      mention     |     entity     |     review    |
+-------------------------+------------------+----------------+---------------+
| "Directed by Robert..." | "Tom Hanks"      | imdb:nm0000158 | :MovieReview  |
| "Directed by Robert..." | "Robert Zemeckis"| imdb:nm0000709 | :MovieReview  |
+-------------------------+------------------+----------------+---------------+

The dictionary extractor is called in a similar way to linker, with an extra argument docs:mode set to docs:Dictionary.

prefix docs: <tag:stardog:api:docs:>

select * {
  ?review :content ?text

  service docs:entityExtractor {
    []  docs:text ?text ;
        docs:mention ?mention ;
        docs:entity ?entity ;
        docs:mode docs:Dictionary
  }
}
+-------------------------+------------------+---------------------+---------------+
|         text            |      mention     |        entity       |     review    |
+-------------------------+------------------+---------------------+---------------+
| "Directed by Robert..." | "Tom Hanks"      | imdb:nm0000158      | :MovieReview  |
| "Directed by Robert..." | "Tom Hanks"      | wikipedia:Tom_Hanks | :MovieReview  |
+-------------------------+------------------+---------------------+---------------+

All extractors accept one more output variable, docs:type, which will output the type of entity (e.g., Person, Organization), when available.

prefix docs: <tag:stardog:api:docs:>

select * {
  ?review :content ?text

  service docs:entityExtractor {
    []  docs:text ?text ;
        docs:mention ?mention ;
        docs:entity ?entity ;
        docs:type ?type
  }
}
+-------------------------+------------------+----------------+-----------+---------------+
|         text            |      mention     |     entity     |    type   |     review    |
+-------------------------+------------------+----------------+-----------+---------------+
| "Directed by Robert..." | "Tom Hanks"      | imdb:nm0000158 | :Person   | :MovieReview  |
| "Directed by Robert..." | "Robert Zemeckis"| imdb:nm0000709 | :Person   | :MovieReview  |
+-------------------------+------------------+----------------+-----------+---------------+

Custom Extractors

The included extractors are intentionally basic, especially when compared to machine learning or text mining algorithms. A custom extractor connects the document store to algorithms tailored specifically to your data. The extractor SPI allows integration of any arbitrary workflow or algorithm from NLP methods like part-of-speech tagging, entity recognition, relationship learning, or sentiment analysis to machine learning models such as document ranking and clustering.

Extracted RDF assertions are stored in a named graph specific to the document, allowing provenance tracking and versatile querying. The extractor must implement the RDFExtractor interface. A convenience class, TextProvidingRDFExtractor, extracts the text from the document before calling the extractor. In addition, AbstractEntityRDFExtractor - or one of its existing subclasses - extends TextProvidingRDFExtractor so you can customize entity linking extraction to your specific needs.

The text extractor SPI gives you the opportunity to support arbitrary document formats. Implementations will be given a raw document and be expected to extract a string of text which will be added to the full-text search index. Text extractors should implement the TextProvidingRDFExtractor interface.

Custom extractors are registered with the Java ServiceLoader under the RDFExtractor or TextProvidingRDFExtractor class names. Custom extractors can be referred to from the command line or APIs by their fully qualified or “simple” class names.

For an example of a custom extractor, see our github repository.

Recomended Reading:

The following blog posts can aid comprehension around entity linking and extraction in Stardog: