Link Search Menu Expand Document
Start for Free

Unstructured Content

This chapter discusses features for unifying unstructured content like PDF documents in Stardog.

Page Contents
  1. Overview
  2. Document Storage
  3. Structured Data Extraction
  4. Custom RDF Extractors
  5. Text Extraction
  6. Managing Documents
  7. Named Graphs and Document Queries

Overview

Unifying unstructured data is, by necessity, a different process from unifying structured or semistructured data. Stardog includes a document storage subsystem called BITES (Blob Indexing and Text Enrichment with Semantics), which provides configurable storage and processing for unifying unstructured data with the Stardog graph. The following figure shows the main BITES components:

BITES pipeline

Document Storage

BITES allows storage and retrieval of documents in the form of files. Stardog treats documents as opaque blobs of data; it defers to the extraction process to make sense of individual documents. Document storage is independent of file and data formats.

Stardog internally stores documents as files. The location of these files defaults to a subdirectory of STARDOG_HOME, but this can be overridden. Documents can be stored on a local filesystem, or an abstraction thereof, accessible from the Stardog server, or on Amazon S3 by setting the docs.filesystem.uri database configuration option.

The exact location is given by the docs.path database configuration option.

Structured Data Extraction

BITES supports an optional processing stage in which a document is processed to extract an RDF graph to add to the database. BITES has the following built-in RDF extractors:

RDF Extractor Description
tika This extractor is based on Apache Tika, collects metadata about the document, and asserts its extracted set of RDF statements to a named graph specific to the document.
text Adds an RDF statement with the full text extracted from the document. A side effect of this extractor is that a document’s text will be indexed by the search index twice: once for the document itself, and again for the value of this RDF statement.
entities This extractor uses OpenNLP to extract all the mentions of named entities from the document and adds this information to the document named graph.
linker This extractor works just like entities, but after it finds a named entity mention in the document, it also finds the entity in the database that best matches that mention.
dictionary Similar to linker, but using a user-provided dictionary that maps named entity mentions to IRIs.

The CoreNLPEntityLinkerRDFExtractor, CoreNLPMentionRDFExtractor, and CoreNLPRelationRDFExtractor extractors are available through the bites-corenlp repository. Instructions for their setup are provided in the README of the repository.

See Entity Extraction and Linking for more details about some of these extractors.

Custom RDF Extractors

The included RDF extractors are intentionally basic, especially when compared to machine learning or text mining algorithms. A custom extractor connects the document store to algorithms tailored specifically to your data. The extractor SPI allows integration of any arbitrary workflow or algorithm from NLP methods like part-of-speech tagging, entity recognition, relationship learning, or sentiment analysis to machine learning models such as document ranking and clustering.

Extracted RDF assertions are stored in a named graph specific to the document, allowing provenance tracking and versatile querying. The extractor must implement the RDFExtractor interface. A convenience class, TextProvidingRDFExtractor, extracts the text from the document before calling the extractor. In addition, AbstractEntityRDFExtractor - or one of its existing subclasses - extends TextProvidingRDFExtractor so you can customize entity linking extraction to your specific needs.

The text extractor SPI gives you the opportunity to support arbitrary document formats. Implementations will be given a raw document and be expected to extract a string of text, which will be added to the full-text search index. Text extractors should implement the TextProvidingRDFExtractor interface.

Custom extractors are registered with the Java ServiceLoader under the RDFExtractor or TextProvidingRDFExtractor class names. Custom extractors can be referred to from the command line or APIs by their fully qualified or “simple” class names.

For an example of a custom RDF extractor, see our github repository. Note that one may build custom RDF extractors by using built-in or custom entity extractors.

Text Extraction

The document store is fully integrated with Stardog’s Full Text Search. As with RDF extraction, text extraction supports arbitrary file formats, and pluggable extractors are able to retrieve the textual contents of a document for indexing. Once a document is added to BITES, its contents can be searched in the same way as other literals by using the standard textMatch predicate in SPARQL queries.

Managing Documents

CRUD operations on documents can be performed from the command line, Java API, or HTTP API. Please refer to the StardocsConnection API for details of using the document store from Java.

The following is an example session showing how to manage documents from the command line:

We have a document stored in the file whyfp90.pdf, which we will add to the document store.

$ ls -al whyfp90.pdf
-rw-r--r-- 1 user user 200007 Aug 30 09:46 whyfp90.pdf

We add it to the document store and receive the document’s IRI as a return value.

$ stardog doc put myDB whyfp90.pdf
Successfully put document in the document store: tag:stardog:api:docs:myDB:whyfp90.pdf

Adding the same document again will delete all previous extraction results and insert new ones. By setting the correct argument, previous assertions will be kept, and new ones appended.

$ stardog doc put myDB —keep-assertions -r text whyfp90.pdf
Successfully put document in the document store: tag:stardog:api:docs:myDB:whyfp90.pdf

Alternatively, we can add it with a different name. Repeated calls will update the document and refresh extraction results.

$ stardog doc put myDB --name why-functional-programming-matters.pdf whyfp90.pdf
Successfully put document in the document store: tag:stardog:api:docs:myDB:why-functional-programming-matters.pdf

We can subsequently retrieve documents and store them locally.

$ stardog doc get myDB whyfp90.pdf
Wrote document 'whyfp90.pdf' to file 'whyfp90.pdf'

Local files will not be overwritten.

$ stardog doc get myDB whyfp90.pdf
File 'whyfp90.pdf' already exists. You must remove it or specify a different filename.

How many documents are in the document store?

$ stardog doc count myDB
Count: 2 documents

Removing a document will also clear its named graph and full-text search index entries.

$ stardog doc delete myDB whyfp90.pdf
Successfully executed deletion.

Re-indexing the docstore allows us to apply a different rdf or text extractor to all the documents, refreshing extraction results.

$ stardog doc reindex myDB -r entities
"Re-indexed 1 documents"

See the doc command group page for more details about the doc CLI commands.

Named Graphs and Document Queries

Documents in BITES are identified by IRI. As shown in the command line examples above, the IRI is returned from a document put call. The IRI is a combination of a prefix, the database name, and the document name. The CLI uses the document name to refer to the documents. The RDF index, and therefore SPARQL queries, use the IRIs to refer to the documents. RDF assertions extracted from a document are placed into a named graph identified by the document’s IRI.

Here we can see the results of querying a document’s named graph when using the default metadata extractor:

$ stardog query execute myDB "select ?p ?o { graph <tag:stardog:api:docs:myDB:whyfp90.pdf> { ?s ?p ?o } }"
+--------------------------------------------+--------------------------------------+
|                     p                      |                  o                   |
+--------------------------------------------+--------------------------------------+
| rdf:type                                   | http://xmlns.com/foaf/0.1/Document   |
| rdf:type                                   | tag:stardog:api:docs:Document        |
| tag:stardog:api:docs:fileSize              | 200007                               |
| http://purl.org/dc/elements/1.1/identifier | "whyfp90.pdf"                        |
| rdfs:label                                 | "whyfp90.pdf"                        |
| http://ns.adobe.com/pdf/1.3/PDFVersion     | "1.3"                                |
| http://ns.adobe.com/xap/1.0/CreatorTool    | "TeX"                                |
| http://ns.adobe.com/xap/1.0/t/pg/NPages    | 23                                   |
| http://purl.org/dc/terms/created           | "2006-05-19T13:42:00Z"^^xsd:dateTime |
| http://purl.org/dc/elements/1.1/format     | "application/pdf; version=1.3"       |
| http://ns.adobe.com/pdf/1.3/encrypted      | "false"                              |
+--------------------------------------------+--------------------------------------+

Query returned 11 results in 00:00:00.045