Learn how to extend Stardog’s NLP pipeline.
One of the most powerful features of BITES, our unstructured data ingestion system, is the ability to easily create domain-specific NLP pipelines that process and extract structured data from text. We call them Knowledge Extractors and, by default, we ship Stardog with several useful ones. For example,
tika extracts metadata from all kinds of documents, such as title, authors, and creation dates;
entities extracts named entity mentions, while
linker and dictionary further link those entities to nodes in a knowledge graph.
In this post we will show you how we created three new Knowledge Extractors, based on Stanford’s CoreNLP, which we just released as an open source project.
Entity Extraction and Linking
linker extractors are based on OpenNLP. Although they work very well out of the box in most domains, the underlying models sometimes struggle to identify certain named entities. Stanford’s CoreNLP offers a very powerful set of named entity recognition models, which are known to provide state of the art results in several industry datasets.
Expanding our entity recognition and linking modules to use CoreNLP was easy. Internally, those modules work as a pipeline, and the only coupling to OpenNLP was at the first step, i.e., parsing the text and translating it to our internal
Document representation. This representation follows a very similar structure to CoreNLP’s one. By applying that transformation to the pipeline, we created two extractors:
CoreNLPMentionRDFExtractor, which replicates the behavior of
CoreNLPEntityLinkerRDFExtractor, which does the same but for
One of the most interesting features of CoreNLP is the ability not only to extract named entities, but also relationships between them. For example, given the sentence
The Orioles are a professional baseball team based in Baltimore.
We can identify that
Baltimore are named entities, but we can also identify an implicit relationship between them:
Baltimore is the headquarter’s city of the
We created an extractor,
CoreNLPRelationRDFExtractor, that leverages this feature to automatically extract nodes-and-edges from text. For example, running the previous sentence through the extractor, the following output will be generated.
entity:f06574 rdfs:label "Orioles" entity:679a56 rdfs:label "Baltimore" entity:f06574 relation:org:city_of_headquarters entity:679a56
CoreNLP provides models to extract several different kinds of relationships, such as
works_for, and it’s also possible to train your own models to recognize relationships specific to data that you care about.
We released all three extractors as an open source project,
bites-corenlp, available on Github. Using them with Stardog is easy:
- Download the jar
- Add that
jarto Stardog’s classpath, by copying it to the
server/extfolder inside Stardog or by pointing the environment variable
STARDOG_EXTto the directory containing the jar
- Restart the Stardog server
CoreNLPRelationRDFExtractorwill be available as RDF extractors, accessible through the CLI, API, and HTTP interfaces
For example, using the CLI, if you want to add a document to BITES and extract its entities:
stardog doc put --rdf-extractors CoreNLPMentionRDFExtractor myDatabase document.pdf
Multiple extractors can be applied on the same document. For example, the following command will extract metadata and relationships from a document:
stardog doc put --rdf-extractors tika,CoreNLPRelationRDFExtractor myDatabase document.pdf