Unstructured Content
This chapter discusses features for unifying unstructured content like PDF documents in Stardog.
Page Contents
Overview
Unifying unstructured data is, by necessity, a different process from unifying structured or semistructured data. Stardog includes a document storage subsystem called BITES (Blob Indexing and Text Enrichment with Semantics), which provides configurable storage and processing for unifying unstructured data with the Stardog graph. The following figure shows the main BITES components:
Document Storage
BITES allows storage and retrieval of documents in the form of files. Stardog treats documents as opaque blobs of data; it defers to the extraction process to make sense of individual documents. Document storage is independent of file and data formats.
Stardog internally stores documents as files. The location of these files defaults to a subdirectory of STARDOG_HOME
, but this can be overridden. Documents can be stored on a local filesystem, or an abstraction thereof, accessible from the Stardog server, or on Amazon S3 by setting the docs.filesystem.uri
database configuration option.
The exact location is given by the docs.path
database configuration option.
Structured Data Extraction
BITES supports an optional processing stage in which a document is processed to extract an RDF graph to add to the database. BITES has the following built-in RDF extractors:
RDF Extractor | Description |
---|---|
tika | This extractor is based on Apache Tika, collects metadata about the document, and asserts its extracted set of RDF statements to a named graph specific to the document. |
text | Adds an RDF statement with the full text extracted from the document. A side effect of this extractor is that a document’s text will be indexed by the search index twice: once for the document itself, and again for the value of this RDF statement. |
entities | This extractor uses OpenNLP to extract all the mentions of named entities from the document and adds this information to the document named graph. |
linker | This extractor works just like entities , but after it finds a named entity mention in the document, it also finds the entity in the database that best matches that mention. |
dictionary | Similar to linker , but using a user-provided dictionary that maps named entity mentions to IRIs. |
The CoreNLPEntityLinkerRDFExtractor
, CoreNLPMentionRDFExtractor
, and CoreNLPRelationRDFExtractor
extractors are available through the bites-corenlp repository. Instructions for their setup are provided in the README of the repository.
See Entity Extraction and Linking for more details about some of these extractors.
Custom RDF Extractors
The included RDF extractors are intentionally basic, especially when compared to machine learning or text mining algorithms. A custom extractor connects the document store to algorithms tailored specifically to your data. The extractor SPI allows integration of any arbitrary workflow or algorithm from NLP methods like part-of-speech tagging, entity recognition, relationship learning, or sentiment analysis to machine learning models such as document ranking and clustering.
Extracted RDF assertions are stored in a named graph specific to the document, allowing provenance tracking and versatile querying. The extractor must implement the RDFExtractor
interface. A convenience class, TextProvidingRDFExtractor
, extracts the text from the document before calling the extractor. In addition, AbstractEntityRDFExtractor
- or one of its existing subclasses - extends TextProvidingRDFExtractor
so you can customize entity linking extraction to your specific needs.
The text extractor SPI gives you the opportunity to support arbitrary document formats. Implementations will be given a raw document and be expected to extract a string of text, which will be added to the full-text search index. Text extractors should implement the TextProvidingRDFExtractor
interface.
Custom extractors are registered with the Java ServiceLoader under the RDFExtractor
or TextProvidingRDFExtractor
class names. Custom extractors can be referred to from the command line or APIs by their fully qualified or “simple” class names.
For an example of a custom RDF extractor, see our github repository. Note that one may build custom RDF extractors by using built-in or custom entity extractors.
Text Extraction
The document store is fully integrated with Stardog’s Full Text Search. As with RDF extraction, text extraction supports arbitrary file formats, and pluggable extractors are able to retrieve the textual contents of a document for indexing. Once a document is added to BITES, its contents can be searched in the same way as other literals by using the standard textMatch
predicate in SPARQL queries.
Managing Documents
CRUD operations on documents can be performed from the command line, Java API, or HTTP API. Please refer to the StardocsConnection
API for details of using the document store from Java.
The following is an example session showing how to manage documents from the command line:
We have a document stored in the file whyfp90.pdf
, which we will add to the document store.
$ ls -al whyfp90.pdf
-rw-r--r-- 1 user user 200007 Aug 30 09:46 whyfp90.pdf
We add it to the document store and receive the document’s IRI as a return value.
$ stardog doc put myDB whyfp90.pdf
Successfully put document in the document store: tag:stardog:api:docs:myDB:whyfp90.pdf
Adding the same document again will delete all previous extraction results and insert new ones. By setting the correct argument, previous assertions will be kept, and new ones appended.
$ stardog doc put myDB —keep-assertions -r text whyfp90.pdf
Successfully put document in the document store: tag:stardog:api:docs:myDB:whyfp90.pdf
Alternatively, we can add it with a different name. Repeated calls will update the document and refresh extraction results.
$ stardog doc put myDB --name why-functional-programming-matters.pdf whyfp90.pdf
Successfully put document in the document store: tag:stardog:api:docs:myDB:why-functional-programming-matters.pdf
We can subsequently retrieve documents and store them locally.
$ stardog doc get myDB whyfp90.pdf
Wrote document 'whyfp90.pdf' to file 'whyfp90.pdf'
Local files will not be overwritten.
$ stardog doc get myDB whyfp90.pdf
File 'whyfp90.pdf' already exists. You must remove it or specify a different filename.
How many documents are in the document store?
$ stardog doc count myDB
Count: 2 documents
Removing a document will also clear its named graph and full-text search index entries.
$ stardog doc delete myDB whyfp90.pdf
Successfully executed deletion.
Re-indexing the docstore allows us to apply a different rdf or text extractor to all the documents, refreshing extraction results.
$ stardog doc reindex myDB -r entities
"Re-indexed 1 documents"
See the doc
command group page for more details about the doc
CLI commands.
Named Graphs and Document Queries
Documents in BITES are identified by IRI. As shown in the command line examples above, the IRI is returned from a document put
call. The IRI is a combination of a prefix, the database name, and the document name. The CLI uses the document name to refer to the documents. The RDF index, and therefore SPARQL queries, use the IRIs to refer to the documents. RDF assertions extracted from a document are placed into a named graph identified by the document’s IRI.
Here we can see the results of querying a document’s named graph when using the default metadata extractor:
$ stardog query execute myDB "select ?p ?o { graph <tag:stardog:api:docs:myDB:whyfp90.pdf> { ?s ?p ?o } }"
+--------------------------------------------+--------------------------------------+
| p | o |
+--------------------------------------------+--------------------------------------+
| rdf:type | http://xmlns.com/foaf/0.1/Document |
| rdf:type | tag:stardog:api:docs:Document |
| tag:stardog:api:docs:fileSize | 200007 |
| http://purl.org/dc/elements/1.1/identifier | "whyfp90.pdf" |
| rdfs:label | "whyfp90.pdf" |
| http://ns.adobe.com/pdf/1.3/PDFVersion | "1.3" |
| http://ns.adobe.com/xap/1.0/CreatorTool | "TeX" |
| http://ns.adobe.com/xap/1.0/t/pg/NPages | 23 |
| http://purl.org/dc/terms/created | "2006-05-19T13:42:00Z"^^xsd:dateTime |
| http://purl.org/dc/elements/1.1/format | "application/pdf; version=1.3" |
| http://ns.adobe.com/pdf/1.3/encrypted | "false" |
+--------------------------------------------+--------------------------------------+
Query returned 11 results in 00:00:00.045