Full-text Search

This page discusses Stardog’s full-text search capabilities, including lexical search and semantic search.

Page Contents

Overview
Indexing Strategy
Enabling Search
Performance concerns for semantic search
Precomputed semantic vectors
Integration with SPARQL
Customization of Indexing
Search Syntax
Performance Hints

Overview

Stardog’s built-in full-text search system indexes data stored in Stardog for information retrieval queries. Search is not supported over data in virtual sources.

The search supports two modes: lexical mode is based on similarity of word forms, semantic mode compares vector embeddings. Both modes provide a way to find similar natural language texts, but they differ in how they estimate similarity.

Older version of Stardog introduced beta implementation of semantic search powered by extrenal service. Such deployment model proved to be rather challenging, so it was eventually superseded by another implementation which stores semantic vectors in the index used for full-text search.

Indexing Strategy

The indexing strategy creates a “search document” per RDF literal. Each document consists of several fields:

literal ID
literal value, if lexical search is enabled
vector embedding, if semantic search is enabled

See Custom Analyzer for details on customizing Stardog’s search programmatically.

Enabling Search

Full-text support for a database is disabled by default but can be enabled at any time by setting the configuration option search.enabled to true. For example, you can create a database with full-text support as follows:

From CLI

$ stardog-admin db create -o search.enabled=true -n myDb

This will enable full-text search in lexical mode. Semantic mode can be enabled with search.semantic.enabled option:

$ stardog-admin db create -o search.enabled=true search.semantic.enabled=true -n myDb

It is also possible to enable semantic mode and disable lexical mode:

$ stardog-admin db create -o search.enabled=true search.semantic.enabled=true search.lexical.enabled=false -n myDb

Semantic mode requires a language model. By default, a model bundled with Stardog (all-MiniLM-L6-v2) is used when semantic mode is enabled. It can be overriden with search.semantic.model option:

$ stardog-admin db create -o search.enabled=true search.semantic.enabled=true search.semantic.model="djl://ai.djl.huggingface.pytorch/sentence-transformers/paraphrase-albert-small-v2" -n myDb

Using Java

Similarly, you can set the SearchOptions.SEARCHABLE, SearchOptions.SEARCH_SEMANTIC, SearchOptions.SEARCH_SEMANTIC_MODEL and SearchOptions.SEARCH_LEXICAL options when creating the database programmatically:

// Create databases with different full-text index options
dbms.newDatabase("waldoTestLexicalOnly")
        .set(SearchOptions.SEARCHABLE, true)
        .create();
dbms.newDatabase("waldoTestLexicalAndSemantic")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.SEARCH_SEMANTIC, true)
        .create();
dbms.newDatabase("waldoTestLexicalAndSemanticCustomModel")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.SEARCH_SEMANTIC, true)
        .set(SearchOptions.SEARCH_SEMANTIC_MODEL, "djl://ai.djl.huggingface.pytorch/sentence-transformers/paraphrase-albert-small-v2")
dbms.newDatabase("waldoTestSemanticOnly")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.SEARCH_SEMANTIC, true)
        .set(SearchOptions.SEARCH_LEXICAL, false)
        .create();

Performance concerns for semantic search

Building semantic vectors is a compute-intensive process that benefits from GPU acceleration. Indexing on machines that do not have GPUs with semantic search enabled can be very slow. Another way to improve indexing performance is limiting the number of literals to index with context filters.

Memory usage is also affected. Vector embeddings for semantic search reside in java heap until they are written to search index. This can incur a noticeable overhead compared to indexing in lexical mode only. The lower bound of this overhead can be estimated as number of literals times language model dimensions times vector element size (the latter is 4 bytes). The default language model used by Stardog has 384 dimensions.

Precomputed semantic vectors

As an alternative to building semantic vectors during indexing, Stardog supports vectors precomputed externally. Precomputed vectors need to be encoded into special string literals in the following format:

"payload_textENCODED_VECTORnnnn"^^<tag:stardog:api:type:vectorizedString>

Where

payload_text: the actual text
ENCODED_VECTOR: bytes of the vector, in network (big endian) order, encoded into base64 string
nnnn: 4 hex digits representing the length of the base64 string

An example: some textQEj1w0AuFHs=000C

Literals in this format must also be of type tag:stardog:api:type:vectorizedString.

To add literals with precomputed vectors into a database, this database must be created with the following options:

search.enabled=true
search.semantic.enabled=true
search.semantic.model must be set to the model used for creating vectors externally
search.index.datatypes must contain tag:stardog:api:type:vectorizedString

Integration with SPARQL

We use the predicate tag:stardog:api:property:textMatch (or http://jena.hpl.hp.com/ARQ/property#textMatch) to access the search index in a SPARQL query. Note that this supports only lexical search (semantic search requires the Service Form).

The textMatch function has one required argument, the search query in Lucene Syntax and it returns, by default, all literals matching the query string. For example,

SELECT DISTINCT ?s ?score
WHERE {
  ?s ?p ?l.
  (?l ?score) <tag:stardog:api:property:textMatch> 'mac'.
}

This query selects all literals which match ‘mac’. These literals are then joined with the generic BGP ?s ?p ?l to get the resources (?s) that have those literals. Alternatively, you could use ?s rdf:type ex:Book if you only wanted to select the books which reference the search criteria; you can include as many other BGPs as you like to enhance your initial search results.

You can change the number of results textMatch returns by providing an optional second argument with the limit:

SELECT DISTINCT ?s ?score
WHERE {
  ?s ?p ?l.
  (?l ?score) <tag:stardog:api:property:textMatch> ('mac' 100).
}

Limit in textMatch only limits the number of literals returned, which is different than the number of total results the query will return. When a LIMIT is specified in the SPARQL query, it does not affect the full-text search, rather, it only restricts the size of the result set.

Lucene returns a score with each match. It is possible to return these scores and define filters based on the score:

SELECT DISTINCT ?s ?score
WHERE {
  ?s ?p ?l.
  (?l ?score) <tag:stardog:api:property:textMatch> ('mac' 0.5 10).
}

This query returns 10 matching literals where the score is greater than 0.5. Note that, as explained in the Lucene documentation scoring is very much dependent on the way documents are indexed and the range of scores might change significantly between different databases.

Service Form of Search

The textMatch predicate is concise for simple queries. With up to four input constants and two or more output variables, positional arguments can become confusing. An alternate syntax based on SPARQL SERVICE clause is provided. Not only does it make the arguments clear, but also provides some additional features, such as the ability of searching over variable bindings, returning highlighted fragments or getting explanations of search result scores.

With the SERVICE clause syntax, we specify each parameter by name listed in the table below:

Parameter Name	Description
`query`	string to query over a search index
`result`	results received from the search index for a query
`score`	calculated score between a query and a hit result
`threshold`	threshold to include results with scores above or equal
`limit`	limit of the size of the hit results
`offset`	offset that the search results are matched after
`highlight`	fragment of a hit result’s content where the query term(s) match
`explanation`	explanation of how the hit result’s score is calculated
`parsedQuery`	parsed version of the query that Lucene query parser creates
`highlightMaxPassages`	maximum number of highlighted passages when `highlight` parameter is in use
`semantic`	search in semantic mode

Here’s an example using a number of different parameters:

prefix fts: <tag:stardog:api:search:>

SELECT * WHERE {
  service fts:textMatch {
      [] fts:query 'Mexico AND city' ;
         fts:threshold 0.6 ;
         fts:limit 10 ;
         fts:offset 5 ;
         fts:score ?score ;
         fts:result ?res ;
  }
}

For semantic search, the database must have been created with semantic mode support, and the query needs to set semantic parameter to true:

prefix fts: <tag:stardog:api:search:>

SELECT * WHERE {
  service fts:textMatch {
      [] fts:query 'town' ;
         fts:threshold 0.8 ;
         fts:limit 10 ;
         fts:offset 5 ;
         fts:score ?score ;
         fts:result ?res ;
         fts:semantic true ;
  }
}

Searching over Variable Bindings

Search queries aren’t always as simple as a single constant query. It’s possible to perform multiple search queries using other bindings in the SPARQL query as input. This can be accomplished by specifying a variable for the fts:query parameter. In the following example, we use the titles of new books to find related books:

prefix fts: <tag:stardog:api:search:>

SELECT * WHERE {
  # Find new books and their titles. Each title will be used as input to a
  # search query in the full-text index
  ?newBook a :NewBook ; :title ?title .
  service fts:textMatch {
      [] fts:query ?title ;
         fts:score ?score ;
         fts:result ?relatedText ;
  }
  # Bindings of ?relatedText will be used to look up other books in the database
  ?relatedBook :title ?relatedText .
  filter(?newBook != ?relatedBook)
}

Highlighting Relevant Fragments of Search Results

When building search engines, it’s essential not only to find the most relevant results, but also to display them in a way that helps users select the entry most relevant to them. To this end, Stardog provides a highlight argument to the SERVICE clause search syntax. When this argument is given an otherwise unbound variable, the result will include one or more fragments from the string literal returned by the search which include the search terms. The highlightMaxPassages can be used to limit the maximum number of fragments which will be included in the highlight result.

To illustrate, an example query and results are given.

prefix fts: <tag:stardog:api:search:>
SELECT * WHERE {
  service fts:textMatch {
      [] fts:query "mexico AND city" ;
         fts:score ?score ;
         fts:result ?result ;
         fts:highlight ?highlight
  }
}
order by desc(?score)
limit 4

The results might include highlighted fragments such as:

a <b>city</b> in south central <b>Mexico</b> (southeast of <b>Mexico</b>
<b>City</b>) on the edge of central Mexican plateau

Note: highlighting works only with lexical search.

Getting Explanation of Search Result Scores

Search engines produce a set of results ranked by relevance scores between a query and document hits. This allows applications to serve the most relevant results. By default, BM25Similarity algorithm is leveraged to accomplish this. If one would like to understand how scores are computed, Stardog can supply an explanation. For such a purpose, an fts:explanation argument is needed in the SERVICE clause.

A sample query can be written as follows:

prefix fts: <tag:stardog:api:search:>
SELECT * WHERE {
  service fts:textMatch {
      [] fts:query "mexico AND city" ;
         fts:score ?score ;
         fts:result ?result ;
         fts:explanation ?explanation ;
  }
}
order by desc(?score)
limit 4

To demonstrate explanation output from this query, let’s examine the top hit’s explanation:

8.041948 = sum of:
    4.54086 = weight(value:mexico in 98) [BM25Similarity], 
    result of:
        4.54086 = score(doc=98,freq=1.0 = termFreq=1.0), 
        product of:
            4.54086 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) 
            from:
                17.0 = docFreq
                1640.0 = docCount
            1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) 
            from:
                1.0 = termFreq=1.0
                1.2 = parameter k1
                0.0 = parameter b (norms omitted for field)
    3.5010884 = weight(value:city in 98) [BM25Similarity], 
    result of:
        3.5010884 = score(doc=98,freq=1.0 = termFreq=1.0), 
        product of:
            3.5010884 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) 
            from:
                49.0 = docFreq
                1640.0 = docCount
            1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) 
            from:
                1.0 = termFreq=1.0
                1.2 = parameter k1
                0.0 = parameter b (norms omitted for field)

Weights, calculated as weighted TF-IDF of BM25, per token in the query are calculated by IDF (an informativeness measure of a token) and normalized TF (relative term frequencies of tokens), are calculated and summed up. In the explanation above, all the details and formula are supplied to demonstrate how the score for the hit is calculated.

Note that b parameters are not present for the reason the default value of search.index.compute.norm option, which is false, is set. Thus, length normalization is not applied and computed in BM25. To compute the norms, set search.index.compute.norm to true. For more details, check here.

Customization of Indexing

There may be a variety of needs when performing full-text search. Stardog offers customized indexing and searching programmatically via Custom Analyzer or with options of features from indexing certain data types, excluding certain properties, to tokenizing with specific splitters, defining stop words. These options are listed as follows:

Data Types to be Indexed

Only the content with the registered literal data types are indexed, and thus can be searched. The default types are literals with xsd:string and rdf:langString types.

These data types are configurable with the search.index.datatypes option in case of need.

From CLI

$ stardog-admin db create -o search.enabled=true search.index.datatypes=urn:String,urn:Date -n myDb

Using Java

// Create a database with full-text index with specific data types

List<IRI> dataTypeList = Lists.newArrayList(
        Datatype.STRING.iri(), 
        Datatype.DATE.iri(), 
        Datatype.DATETIME.iri());

dbms.newDatabase("waldoTest")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.INDEX_DATATYPES, dataTypeList)
        .create();

Exclusion List of Properties

In some cases it might be unnecessary to index literals used with certain properties. For example, we might not need to index the phone numbers if search queries are not expected to match phone numbers. search.index.properties.excluded option can be used in this case.

None of the properties are excluded by default.

From CLI

$ stardog-admin db create -o search.enabled=true search.index.properties.excluded=http:\/\/example.com\#hasPhone -n myDb

Multiple properties can be provided as a comma-separated list:

$ stardog-admin db create -o search.enabled=true search.index.properties.excluded=http:\/\/example.com\#hasPhone,http:\/\/example.com\#hasFax -n myDb

Using Java

// Create a database with full-text index excluding specific properties

IRI hasPhone = Values.iri("http://example.com#hasPhone");

dbms.newDatabase("waldoTest")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.INDEX_PROPERTIES_EXCLUDED, 
		    Sets.newHashSet(hasPhone))
        .create();

Inclusion List of Properties

When a user would like to index only literals used with specific properties, search.index.properties.included option can be leveraged. For example, we can specify that only rdfs:label values should be included in the search index. Any literal used with a different property will be excluded from the search index.

None of the properties are included by default.

Note that if both search.index.properties.excluded and search.index.properties.included options are used, only the search.index.properties.excluded option will be uses.

From CLI

$ stardog-admin db create -o search.enabled=true search.index.properties.included=http:\/\/www.w3.org\/2000\/01\/rdf-schema\#label -n myDb

Multiple properties can be provided as a comma-separated list:

$ stardog-admin db create -o search.enabled=true search.index.properties.included=http:\/\/www.w3.org\/2000\/01\/rdf-schema\#label,http:\/\/www.w3.org\/2000\/01\/rdf-schema\#comment -n myDb

Using Java

// Create a database with full-text index including specific properties

dbms.newDatabase("waldoTest")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.INDEX_PROPERTIES_INCLUDED, 
		    Sets.newHashSet(RDFS.LABEL))
        .create();

Inclusion or Exclusion List of Contexts

Indexing literals with or without certain contexts could be convenient, too. For this purpose, exclusion or inclusion of certain contexts is possible via search.index.contexts.filter and search.index.contexts.excluded options.

search.index.contexts.filter option is either an exclusive or an inclusive context list depending on search.index.contexts.excluded boolean option: If the boolean option is true, which is the default case, the list becomes exclusive, otherwise inclusive.

None of the properties are excluded or included by default.

From CLI

$ stardog-admin db create -o search.enabled=true search.index.contexts.filter=http:\/\/www.w3.org\/2006\/vcard\/ns\#Address,http:\/\/www.w3.org\/2006\/vcard\/ns\#postal-code search.index.contexts.excluded=false -n myDb

Note that special characters of contexts are needed to be escaped.

Using Java

// Create a database with full-text index only for the literals with #Address context
IRI address = Values.iri("http://www.w3.org/2006/vcard/ns#Address");
IRI postalCode = Values.iri("http://www.w3.org/2006/vcard/ns#postal-code");
dbms.newDatabase("waldoTest")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.SEARCH_CONTEXTS_EXCLUDED, false)
        .set(SearchOptions.SEARCH_CONTEXTS_FILTER,
		    Sets.newHashSet(address, postalCode))
        .create();

It is also possible to include or exclude certain contexts specifically for semantic indexing using search.semantic.index.contexts.filter.

Tokenizer

By default, Stardog uses Apache Lucene’s StandardAnalyzer that makes use of StandardTokenizer to create terms to index from free text. However, customization of tokenization behavior can be beneficial for some domains (e.g. biologists may not wish tokenizers break at dashes to keep their specific entity terms as a single token). For this reason, a string of characters (search.index.wordbreak.chars option) is leveraged as token splitters, in order to produce tokens to be indexed. When this list is introduced, a tokenizer is created as extending Lucene’s CharTokenizer with customized token separators with a pre-added whitespace.

Note that from the CLI some characters need to be escaped due to the option parser.

From CLI

$ stardog-admin db create -o search.enabled=true search.index.wordbreak.chars=\(\)\[\]! -n myDb

Using Java

// Create a database with full-text index with a customized tokenizer

dbms.newDatabase("waldoTest")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.WORD_BREAK_CHARS, "()[]")
        .create();

Stop Words

Stop words are terms those are omitted while indexing literal content and analyzing queries. By default, Stardog defines Apache Lucene’s default ENGLISH_STOP_WORDS_SET:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

However, these terms may be salient literal items to some domains. Therefore, stop words, via search.index.stopwords option, can be customized for specific domains. Let’s consider indexing international country codes. Terms of AS, AT, BE, BY, IN, IS, IT, NO, TO overlaps with the terms in the default set, hence not indexed and not resulted in searches. In such cases, stop words could be altered.

From CLI

$ stardog-admin db create -o search.enabled=true search.index.stopwords=a,an,that,the -n myDb

Using Java

// Create a database with full-text index with a customized stop words

dbms.newDatabase("waldoTest")
        .set(SearchOptions.SEARCHABLE, true)
        .set(SearchOptions.INDEX_STOPWORDS, 
             Lists.newArrayList("a", "an", "that", "the"))
        .create();

Applying Customization to Existing Databases

If any customization described above is needed for an existing database, it is possible to apply while the database is offline. Stardog will make the search index consistent in respect of the changes by reindexing, while coming online.

From CLI

$ stardog-admin db create -o search.enabled=true -n myDb
$ stardog-admin db offline myDb
$ stardog-admin metadata set -o search.index.stopwords=a,an,that,the -- myDb
$ stardog-admin db online myDb

Using Java

dbms.offline("waldoTest", 0, TimeUnit.SECONDS);
dbms.set("waldoTest", 
         SearchOptions.INDEX_PROPERTIES_EXCLUDED, 
         Sets.newHashSet(VCard.GEO));
dbms.set("waldoTest", 
        SearchOptions.INDEX_STOPWORDS, 
        Sets.newHashSet(""));
dbms.online("waldoTest");

Using Search with No Stop Words

Search can be set to use no stop words using the CLI and through Java.

From CLI

$ stardog-admin db offline -n dbName
$ stardog-admin metadata set -o search.index.stopwords={} -- dbName
$ stardog-admin db online -n dbName

Using Java

dbms.offline("waldoTest", 0, TimeUnit.SECONDS);
dbms.set("waldoTest", SearchOptions.INDEX_STOPWORDS, Lists.newArrayList(""));
dbms.online("waldoTest");

Search Syntax

Stardog search is based on Lucene 7.4.0: we support all of the search modifiers that Lucene supports.

In the table below, each special character used in Lucene’s query parser syntax and usage are listed:

Character	Usage	Description
+	boolean operator	indicates that the term beginning with this character must occur
-	boolean operator	indicates that the term beginning with this character must not occur
&	boolean operator	`&&` expression is an and operator for terms to indicate every term must occur, equals to `AND`
\|	boolean operator	`\|\|` expression is an or operator for terms to indicate any of the terms may occur, equals to `OR`
!	boolean operator	a negation operator for terms to indicate a term must not occur, equals to `NOT`
^	boosting a term	increases weight of a term in a query in order to get more relevant results to that specific term
\	escape char	escapes a special character to literally index it
:	field query	separator between the field name and the term
~	fuzzy query	makes fuzzy search available for terms with distance algorithms
~	proximity query	indicates a term-distance restriction is applied in a phrase query
(	grouping	start expression of a grouping
)	grouping	end expression of a grouping
“	phrase query	surround multiple terms to construct a phrase query
{	range query	start expression of an exclusive range query
}	range query	end expression of an exclusive range query
[	range query	start expression of an inclusive range query
]	range query	end expression of an inclusive range query
/	regexp queries	start and end character that regex patterns are put between to match a term
*	wildcard query	wildcard character to match any multiple characters
?	wildcard query	wildcard character to match any single character

Query Samples

Let’s construct a tag:stardog:api:property:textMatch service to demonstrate some Lucene query samples of different types of queries where the above-mentioned special characters are used to generate those.

Query String	Explanation
`+gandalf -gray`	indicates that `gandalf` must occur and `gray` must not with boolean operators
`(gandalf AND gray) !(saruman)`	indicates that `gandalf` and `gray` must occur together and `saruman` must not with boolean operators
`value:{Fangorn TO Gandalf}`	range query sent to a field to retrieve terms lexicographically ranging between two terms
`/[0-9]{3}/`	regex query to match terms having 3 digit pattern, note that query matches with only a term
`Elfish~0.8`	fuzzy query to match terms having a distance to the term less than `0.8`, e.g `Elvish` could match
`\"ancient language\"~4`	proximity query with a boundary limit of closeness, constructed as a phrase query
`Elvish^4 && Language`	term boosting on the term `Elvish` to retrieve hits more relevant to this term
`el?ish`	wildcard query to match any character to the third one, e.g both `elvish` and `elfish` could match

For a more detailed discussion, see the Lucene docs.

Escaping Characters in Search

If a special character is demanded to be searched with its literal usage rather than the specific cases described above, it has to be escaped with a backslash.

In some cases, even escaping characters with a backslash interferes with query parsers. Some exceptions could be produced in such cases:

Query String	Exception
`'09/14'`	Cannot parse ‘09/14’: Lexical error at line 1, column 6. Encountered: after : "/14"
`'09\/14'`	Lexical error at line 1, column 143. Encountered: “/” (47), after : “'09\”

Hence, a function to escape strings for specific characters is introduced:

escape(query, escapedCharacters) function could be used to escape characters in query string by binding the escaped string into a variable, then that variable can be passed to fts:query parameter:

prefix fts: <tag:stardog:api:search:>

SELECT 
    ?l (round(?scoreRaw, 2) as ?score) ?e 
    {bind(escape('09/14', '/') as ?query)
    service fts:textMatch { 
        [] fts:result ?l ; 
           fts:query ?query ; 
           fts:score ?scoreRaw ; 
           fts:explanation ?e ; 
    }
}

Escaping Characters over Variable Bindings

When a variable for the fts:query parameter is specified, queries passed to that variable may contain special characters those may create parse exceptions or change the query type other than the demanded. For those cases, using escape function makes the search safe:

prefix fts: <tag:stardog:api:search:>

SELECT * WHERE {
  ?album a :Album ; :track/:name ?songName .
  BIND(escape(?songName, '/!') as ?songNameEscaped) .  
  service fts:textMatch {
      [] fts:query ?songNameEscaped ;
         fts:score ?score ;
         fts:result ?otherSong ;
  }
  ?otherAlbum a :Album ;
  :track/:name ?otherSong .
  FILTER(?album != ?otherAlbum)
}

Below are some sample results of the query above demonstrating some escaped chars:

album	songName	songNameEscaped	score	otherSong	otherAlbum
:The_Woman_in_Me (Shania_Twain_album)	“(If You’re Not in It for Love) I’m Outta Here!”	“(If You’re Not in It for Love) I’m Outta Here\!”	2.66	“Somebody to Love (Queen song)”	:A_Day_at_the_Races (album)
:(Miss)understood	“Bold & Delicious/Pride”	“Bold & Delicious\/Pride”	7.07	“Mother’s Pride (song)”	:Listen_Without_Prejudice Vol._1

Unless escaped, an exception is thrown while executing the query above:

com.complexible.stardog.plan.eval.operator.OperatorException: com.complexible.stardog.search.SearchException: Cannot parse '(If You're Not in It for Love) I'm Outta Here!': Encountered "<EOF>" at line 1, column 46.`

Performance Hints

Typically, full-text search queries are selective because the number of literals matching a search query will be relatively small, e.g. in the thousands not in the millions or more. However, in some cases search query can be non-selective. For instance, a wildcard query a* would match all strings those start with the character ‘a’ would match a very large number of literals, and it would be slow to look them all up from the search index.

A full-text search query is non-selective when its size of the terms bound is limited and its cardinality is high. An optimization takes place when non-selectivity is detected as using document ids belonging to the terms of the query as a filter in full-text search service. Thus, such non-selective and time-consuming full-text search queries are prohibited, therefore more selective patterns are executed priorly, then a filter is passed to the full-text search query which ends up much quicker.

There are default cardinality thresholds used by the optimizer to decide when to use the optimizations and query hints for full-text search are defined for users to override the defaults.

Suppose we have a SPARQL query containing the following patterns that demonstrates a non-selective full-text search:

...
?product rdfs:label ?label .
?label <tag:stardog:api:property:textMatch> "a*" .
...

The execution of the query is planned as below via a join over all the labels matching the full-text search and the variable bound to literals:

`─ HashJoin(?label) [#410K]
   +─ Scan[PSOC](?product, <http://www.w3.org/2000/01/rdf-schema#label>, ?label) [#34K]
   `─ Full-Text(query="/.*?/") -> ?label [#90K]

This plan is generated because, by default, Stardog uses a cardinality threshold of 1M to apply the aforementioned optimization but the cardinality estimation for full text search is 90K. Optimization can be manually enabled via the search.push.threshold hint like below.

...
where{

    #pragma search.push.threshold 10000
    
    ?product ?prodProperty ?prodObject ;
             dc:publisher ?publisher ;
             rdfs:label ?label .
...

Intermediate search results decreased from 90K to 17K, making use of document ids as a filter retrieved with prior execution of other patterns:

`─ MergeJoin(?product) [#17K]
   +─ Scan[PSOC](?product, dc:publisher, ?publisher) [#903K]
   `─ Filter(<tag:stardog:api:search:textMatchFilter>(?label, "/.*?/")) [#17K]
      `─ Scan[PSOC](?product, <http://www.w3.org/2000/01/rdf-schema#label>, ?label) [#34K]