Link Search Menu Expand Document
Start for Free

Full-text Search

This page discusses Stardog’s full-text search capabilities.

Page Contents
  1. Overview
  2. Indexing Strategy
  3. Enabling Search
    1. Using Java
  4. Integration with SPARQL
    1. Service Form of Search
    2. Searching over Variable Bindings
    3. Highlighting Relevant Fragments of Search Results
    4. Getting Explanation of Search Result Scores
  5. Customization of Indexing
    1. Data Types to be Indexed
    2. Exclusion List of Properties
    3. Tokenizer
    4. Stop Words
    5. Applying Customization to Existing Databases
  6. Search Syntax
    1. Escaping Characters in Search

Overview

Stardog’s builtin full-text search system indexes data stored in Stardog for information retrieval queries. Search is not supported over data in virtual sources.

Indexing Strategy

The indexing strategy creates a “search document” per RDF literal. Each document consists of two fields: literal ID and literal value. See Custom Analyzer for details on customizing Stardog’s search programmatically.

Full-text support for a database is disabled by default but can be enabled at any time by setting the configuration option search.enabled to true. For example, you can create a database with full-text support as follows:

$ stardog-admin db create -o search.enabled=true -n myDb

Using Java

Similarly, you can set the option using SearchOptions.SEARCHABLE when creating the database programmatically:

// Create a database with full-text index
dbms.newDatabase("waldoTest")
    .set(SearchOptions.SEARCHABLE, true)
    .create();

Integration with SPARQL

We use the predicate tag:stardog:api:property:textMatch (or http://jena.hpl.hp.com/ARQ/property#textMatch) to access the search index in a SPARQL query.

The textMatch function has one required argument, the search query in Lucene Syntax and it returns, by default, all literals matching the query string. For example,

SELECT DISTINCT ?s ?score
WHERE {
  ?s ?p ?l.
  (?l ?score) <tag:stardog:api:property:textMatch> 'mac'.
}

This query selects all literals which match ‘mac’. These literals are then joined with the generic BGP ?s ?p ?l to get the resources (?s) that have those literals. Alternatively, you could use ?s rdf:type ex:Book if you only wanted to select the books which reference the search criteria; you can include as many other BGPs as you like to enhance your initial search results.

You can change the number of results textMatch returns by providing an optional second argument with the limit:

SELECT DISTINCT ?s ?score
WHERE {
  ?s ?p ?l.
  (?l ?score) <tag:stardog:api:property:textMatch> ('mac' 100).
}

Limit in textMatch only limits the number of literals returned, which is different than the number of total results the query will return. When a LIMIT is specified in the SPARQL query, it does not affect the full-text search, rather, it only restricts the size of the result set.

Lucene returns a score with each match. It is possible to return these scores and define filters based on the score:

SELECT DISTINCT ?s ?score
WHERE {
  ?s ?p ?l.
  (?l ?score) <tag:stardog:api:property:textMatch> ('mac' 0.5 10).
}

This query returns 10 matching literals where the score is greater than 0.5. Note that, as explained in the Lucene documentation scoring is very much dependent on the way documents are indexed and the range of scores might change significantly between different databases.

The textMatch predicate is concise for simple queries. With up to four input constants and two or more output variables, positional arguments can become confusing. An alternate syntax based on SPARQL SERVICE clause is provided. Not only does it make the arguments clear, but also provides some additional features, such as the ability searching over variable bindings and return highlighted fragments, both described below.

With the SERVICE clause syntax, we specify each parameter by name. Here’s an example using a number of different parameters:

prefix fts: <tag:stardog:api:search:>

SELECT * WHERE {
  service fts:textMatch {
      [] fts:query 'Mexico AND city' ;
         fts:threshold 0.6 ;
         fts:limit 10 ;
         fts:offset 5 ;
         fts:score ?score ;
         fts:result ?res ;
  }
}

Searching over Variable Bindings

Search queries aren’t always as simple as a single constant query. It’s possible to perform multiple search queries using other bindings in the SPARQL query as input. This can be accomplished by specifying a variable for the fts:query parameter. In the following example, we use the titles of new books to find related books:

prefix fts: <tag:stardog:api:search:>

SELECT * WHERE {
  # Find new books and their titles. Each title will be used as input to a
  # search query in the full-text index
  ?newBook a :NewBook ; :title ?title .
  service fts:textMatch {
      [] fts:query ?title ;
         fts:score ?score ;
         fts:result ?relatedText ;
  }
  # Bindings of ?relatedText will be used to look up other books in the database
  ?relatedBook :title ?relatedText .
  filter(?newBook != ?relatedBook)
}

Highlighting Relevant Fragments of Search Results

When building search engines, it’s essential not only to find the most relevant results, but also to display them in a way that helps users select the entry most relevant to them. To this end, Stardog provides a highlight argument to the SERVICE clause search syntax. When this argument is given an otherwise unbound variable, the result will include one or more fragments from the string literal returned by the search which include the search terms. The highlightMaxPassages can be used to limit the maximum number of fragments which will be included in the highlight result.

To illustrate, an example query and results are given.

prefix fts: <tag:stardog:api:search:>
SELECT * WHERE {
  service fts:textMatch {
      [] fts:query "mexico AND city" ;
         fts:score ?score ;
         fts:result ?result ;
         fts:highlight ?highlight
  }
}
order by desc(?score)
limit 4

The results might include highlighted fragments such as:

a <b>city</b> in south central <b>Mexico</b> (southeast of <b>Mexico</b>
<b>City</b>) on the edge of central Mexican plateau

Getting Explanation of Search Result Scores

Search engines produce a set of results ranked by relevance scores between a query and document hits. This allows applications to serve the most relevant results. By default, BM25Similarity algorithm is leveraged to accomplish this. If one would like to understand how scores are computed, Stardog can supply an explanation. For such a purpose, an fts:explanation argument is needed in the SERVICE clause.

A sample query can be written as follows:

prefix fts: <tag:stardog:api:search:>
SELECT * WHERE {
  service fts:textMatch {
      [] fts:query "mexico AND city" ;
         fts:score ?score ;
         fts:result ?result ;
         fts:explanation ?explanation ;
  }
}
order by desc(?score)
limit 4

To demonstrate explanation output from this query, let’s examine the top hit’s explanation:

8.041948 = sum of:
    4.54086 = weight(value:mexico in 98) [BM25Similarity], 
    result of:
        4.54086 = score(doc=98,freq=1.0 = termFreq=1.0), 
        product of:
            4.54086 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) 
            from:
                17.0 = docFreq
                1640.0 = docCount
            1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) 
            from:
                1.0 = termFreq=1.0
                1.2 = parameter k1
                0.0 = parameter b (norms omitted for field)
    3.5010884 = weight(value:city in 98) [BM25Similarity], 
    result of:
        3.5010884 = score(doc=98,freq=1.0 = termFreq=1.0), 
        product of:
            3.5010884 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) 
            from:
                49.0 = docFreq
                1640.0 = docCount
            1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) 
            from:
                1.0 = termFreq=1.0
                1.2 = parameter k1
                0.0 = parameter b (norms omitted for field)

Weights, calculated as weighted TF-IDF of BM25, per token in the query are calculated by IDF (an informativeness measure of a token) and normalized TF (relative term frequencies of tokens), are calculated and summed up. In the explanation above, all the details and formula are supplied to demonstrate how the score for the hit is calculated.

Note that b parameters are not present for the reason the default value of search.index.compute.norm option, which is false, is set. Thus, length normalization is not applied and computed in BM25. To compute the norms, set search.index.compute.norm to true. For more details, check here.

Customization of Indexing

There may be a variety of needs when performing full-text search. Stardog offers customized indexing and searching programmatically via Custom Analyzer or with options of features from indexing certain data types, excluding certain properties, to tokenizing with specific splitters, defining stop words. These options are listed as follows:

Data Types to be Indexed

Only the content with the registered literal data types are indexed, and thus can be searched. The default types are literals with xsd:string and rdf:langString types.

These data types are configurable with the search.index.datatypes option in case of need.

From CLI

$ stardog-admin db create -o search.enabled=true search.index.datatypes=urn:String,urn:Date -n myDb

Using Java

// Create a database with full-text index with specific data types

List<IRI> dataTypeList = Lists.newArrayList(
        Datatype.STRING.iri(), 
        Datatype.DATE.iri(), 
        Datatype.DATETIME.iri())

dbms.newDatabase("waldoTest")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.INDEX_DATATYPES, dataTypeList)
        .create();

Exclusion List of Properties

It may come to a case where some literals with certain properties are not demanded to be indexed. For instance, literals with Address types could be thought as unnecessary to index. search.index.properties.excluded option covers that purpose.

None of the properties are excluded by default.

From CLI

$ stardog-admin db create -o search.enabled=true search.index.properties.excluded=urn:Address -n myDb

Using Java

// Create a database with full-text index excluding specific properties

IRI Address = new IRIImpl("http://www.w3.org/2006/vcard/ns#Address");

dbms.newDatabase("waldoTest")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.INDEX_PROPERTIES_EXCLUDED, 
             Lists.newArrayList(Address))
        .create();

Tokenizer

By default, Stardog uses Apache Lucene’s StandardAnalyzer that makes use of StandardTokenizer to create terms to index from free text. However, customization of tokenization behavior can be beneficial for some domains (e.g. biologists may not wish tokenizers break at dashes to keep their specific entity terms as a single token). For this reason, a string of characters (search.index.wordbreak.chars option) is leveraged as token splitters, in order to produce tokens to be indexed. When this list is introduced, a tokenizer is created as extending Lucene’s CharTokenizer with customized token separators with a pre-added whitespace.

Note that from the CLI some characters need to be escaped due to the option parser.

From CLI

$ stardog-admin db create -o search.enabled=true search.index.wordbreak.chars=\(\)\[\]! -n myDb

Using Java

// Create a database with full-text index with a customized tokenizer

dbms.newDatabase("waldoTest")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.WORD_BREAK_CHARS, "()[]")
        .create();

Stop Words

Stop words are terms those are omitted while indexing literal content and analyzing queries. By default, Stardog defines Apache Lucene’s default ENGLISH_STOP_WORDS_SET:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

However, these terms may be salient literal items to some domains. Therefore, stop words, via search.index.stopwords option, can be customized for specific domains. Let’s consider indexing international country codes. Terms of AS, AT, BE, BY, IN, IS, IT, NO, TO overlaps with the terms in the default set, hence not indexed and not resulted in searches. In such cases, stop words could be altered.

From CLI

$ stardog-admin db create -o search.enabled=true search.index.stopwords=a,an,that,the -n myDb

Using Java

// Create a database with full-text index with a customized stop words

dbms.newDatabase("waldoTest")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.INDEX_STOPWORDS, 
             Lists.newArrayList("a", "an", "that", "the"))
        .create();

Applying Customization to Existing Databases

If any customization described above is needed for an existing database, it is possible to apply while the database is offline. Stardog will make the search index consistent in respect of the changes by reindexing, while coming online.

From CLI

$ stardog-admin db create -o search.enabled=true -n myDb
$ stardog-admin db offline
$ stardog-admin metadata set -o search.index.stopwords=a,an,that,the -- myDb
$ stardog-admin db online

Using Java

dbms.offline("waldoTest", 0, TimeUnit.SECONDS);
dbms.set("waldoTest", 
         SearchOptions.INDEX_PROPERTIES_EXCLUDED, 
         Sets.newHashSet(VCard.GEO));
dbms.set("waldoTest", 
        SearchOptions.INDEX_STOPWORDS, 
        Sets.newHashSet(""));
dbms.online("waldoTest");

Search Syntax

Stardog search is based on Lucene 7.4.0: we support all of the search modifiers that Lucene supports, with the exception of fields.

  • wildcards: ? and *
  • fuzzy: ~ with similarity weights (e.g. foo~0.8)
  • proximities: "semantic web"~5
  • term boosting
  • booleans: OR, AND, NOT, +, and -.
  • grouping

For a more detailed discussion, see the Lucene docs.

The “/” character must be escaped because Lucene says so. In fact, there are several characters that are part of Lucene’s query syntax that must be escaped.