Extending BITES
Learn how to extend Stardog’s NLP pipeline.
Page Contents
One of the most powerful features of BITES, our unstructured data ingestion system, is the ability to easily create domain-specific NLP pipelines that process and extract structured data from text. We call them Knowledge Extractors and, by default, we ship Stardog with several useful ones. For example, tika
extracts metadata from all kinds of documents, such as title, authors, and creation dates; entities
extracts named entity mentions, while linker
and dictionary further link those entities to nodes in a knowledge graph.
In this post we will show you how we created three new Knowledge Extractors, based on Stanford’s CoreNLP, which we just released as an open source project.
Entity Extraction and Linking
Our entities
and linker
extractors are based on OpenNLP. Although they work very well out of the box in most domains, the underlying models sometimes struggle to identify certain named entities. Stanford’s CoreNLP offers a very powerful set of named entity recognition models, which are known to provide state of the art results in several industry datasets.
Expanding our entity recognition and linking modules to use CoreNLP was easy. Internally, those modules work as a pipeline, and the only coupling to OpenNLP was at the first step, i.e., parsing the text and translating it to our internal Document
representation. This representation follows a very similar structure to CoreNLP’s one. By applying that transformation to the pipeline, we created two extractors: CoreNLPMentionRDFExtractor
, which replicates the behavior of entities
, and CoreNLPEntityLinkerRDFExtractor
, which does the same but for linker
.
Relation Extraction
One of the most interesting features of CoreNLP is the ability not only to extract named entities, but also relationships between them. For example, given the sentence
The Orioles are a professional baseball team based in Baltimore.
We can identify that Orioles
and Baltimore
are named entities, but we can also identify an implicit relationship between them: Baltimore
is the headquarter’s city of the Orioles
.
We created an extractor, CoreNLPRelationRDFExtractor
, that leverages this feature to automatically extract nodes-and-edges from text. For example, running the previous sentence through the extractor, the following output will be generated.
entity:f06574 rdfs:label "Orioles"
entity:679a56 rdfs:label "Baltimore"
entity:f06574 relation:org:city_of_headquarters entity:679a56
CoreNLP provides models to extract several different kinds of relationships, such as lives_in
and works_for
, and it’s also possible to train your own models to recognize relationships specific to data that you care about.
Usage
We released all three extractors as an open source project, bites-corenlp
, available on Github. Using them with Stardog is easy:
- Download the jar
- Add that
jar
to Stardog’s classpath, by copying it to theserver/ext
folder inside Stardog or by pointing the environment variableSTARDOG_EXT
to the directory containing the jar - Restart the Stardog server
CoreNLPMentionRDFExtractor
,CoreNLPEntityLinkerRDFExtractor
, andCoreNLPRelationRDFExtractor
will be available as RDF extractors, accessible through the CLI, API, and HTTP interfaces
For example, using the CLI, if you want to add a document to BITES and extract its entities:
stardog doc put --rdf-extractors CoreNLPMentionRDFExtractor myDatabase document.pdf
Multiple extractors can be applied on the same document. For example, the following command will extract metadata and relationships from a document:
stardog doc put --rdf-extractors tika,CoreNLPRelationRDFExtractor myDatabase document.pdf