Full-text Search
This page discusses Stardog’s full-text search capabilities.
Page Contents
Overview
Stardog’s builtin full-text search system indexes data stored in Stardog for information retrieval queries. Search is not supported over data in virtual sources.
Indexing Strategy
The indexing strategy creates a “search document” per RDF literal. Each document consists of two fields: literal ID and literal value. See Custom Analyzer for details on customizing Stardog’s search programmatically.
Enabling Search
Full-text support for a database is disabled by default but can be enabled at any time by setting the configuration option search.enabled
to true. For example, you can create a database with full-text support as follows:
$ stardog-admin db create -o search.enabled=true -n myDb
Using Java
Similarly, you can set the option using SearchOptions.SEARCHABLE
when creating the database programmatically:
// Create a disk database with full-text index
dbms.disk("waldoTest")
.set(SearchOptions.SEARCHABLE, true)
.create();
Integration with SPARQL
We use the predicate tag:stardog:api:property:textMatch
(or http://jena.hpl.hp.com/ARQ/property#textMatch
) to access the search index in a SPARQL query.
The textMatch
function has one required argument, the search query in Lucene Syntax and it returns, by default, all literals matching the query string. For example,
SELECT DISTINCT ?s ?score
WHERE {
?s ?p ?l.
(?l ?score) <tag:stardog:api:property:textMatch> 'mac'.
}
This query selects all literals which match ‘mac’. These literals are then joined with the generic BGP ?s ?p ?l
to get the resources (?s
) that have those literals. Alternatively, you could use ?s rdf:type ex:Book
if you only wanted to select the books which reference the search criteria; you can include as many other BGPs as you like to enhance your initial search results.
You can change the number of results textMatch
returns by providing an optional second argument with the limit:
SELECT DISTINCT ?s ?score
WHERE {
?s ?p ?l.
(?l ?score) <tag:stardog:api:property:textMatch> ('mac' 100).
}
Limit in textMatch
only limits the number of literals returned, which is different than the number of total results the query will return. When a LIMIT
is specified in the SPARQL query, it does not affect the full-text search, rather, it only restricts the size of the result set.
Lucene returns a score with each match. It is possible to return these scores and define filters based on the score:
SELECT DISTINCT ?s ?score
WHERE {
?s ?p ?l.
(?l ?score) <tag:stardog:api:property:textMatch> ('mac' 0.5 10).
}
This query returns 10 matching literals where the score is greater than 0.5. Note that, as explained in the Lucene documentation scoring is very much dependent on the way documents are indexed and the range of scores might change significantly between different databases.
Service Form of Search
The textMatch
predicate is concise for simple queries. With up to four input constants and two or more output variables, positional arguments can become confusing. An alternate syntax based on SPARQL SERVICE
clause is provided. Not only does it make the arguments clear, but also provides some additional features, such as the ability searching over variable bindings and return highlighted fragments, both described below.
With the SERVICE
clause syntax, we specify each parameter by name. Here’s an example using a number of different parameters:
prefix fts: <tag:stardog:api:search:>
SELECT * WHERE {
service fts:textMatch {
[] fts:query 'Mexico AND city' ;
fts:threshold 0.6 ;
fts:limit 10 ;
fts:offset 5 ;
fts:score ?score ;
fts:result ?res ;
}
}
Searching over Variable Bindings
Search queries aren’t always as simple as a single constant query. It’s possible to perform multiple search queries using other bindings in the SPARQL query as input. This can be accomplished by specifying a variable for the fts:query
parameter. In the following example, we use the titles of new books to find related books:
prefix fts: <tag:stardog:api:search:>
SELECT * WHERE {
# Find new books and their titles. Each title will be used as input to a
# search query in the full-text index
?newBook a :NewBook ; :title ?title .
service fts:textMatch {
[] fts:query ?title ;
fts:score ?score ;
fts:result ?relatedText ;
}
# Bindings of ?relatedText will be used to look up other books in the database
?relatedBook :title ?relatedText .
filter(?newBook != ?relatedBook)
}
Highlighting Relevant Fragments of Search Results
When building search engines, it’s essential not only to find the most relevant results, but also to display them in a way that helps users select the entry most relevant to them. To this end, Stardog provides a highlight
argument to the SERVICE
clause search syntax. When this argument is given an otherwise unbound variable, the result will include one or more fragments from the string literal returned by the search which include the search terms. The highlightMaxPassages
can be used to limit the maximum number of fragments which will be included in the highlight result.
To illustrate, an example query and results are given.
prefix fts: <tag:stardog:api:search:>
SELECT * WHERE {
service fts:textMatch {
[] fts:query "mexico AND city" ;
fts:score ?score ;
fts:result ?result ;
fts:highlight ?highlight
}
}
order by desc(?score)
limit 4
The results might include highlighted fragments such as:
a <b>city</b> in south central <b>Mexico</b> (southeast of <b>Mexico</b>
<b>City</b>) on the edge of central Mexican plateau
Search Syntax
Stardog search is based on Lucene 7.4.0: we support all of the search modifiers that Lucene supports, with the exception of fields.
- wildcards:
?
and*
- fuzzy:
~
and~
with similarity weights (e.g.foo~0.8
) - proximities:
"semantic web"~5
- term boosting
- booleans:
OR
,AND
,NOT
,+
, and-
. - grouping
For a more detailed discussion, see the Lucene docs.
Escaping Characters in Search
The “/” character must be escaped because Lucene says so. In fact, there are several characters that are part of Lucene’s query syntax that must be escaped.