Semantic Search
This page discusses Stardog’s semantic search capabilities.
Page Contents
Overview
Stardog’s semantic search capability resembles full-text search in a sense that it provides a way to find similar natural-language texts. However, it differs from full text-search in several key aspects.
At the heart of semantic search is the notion of word embeddings: words or phrases are mapped into multi-dimensional vector space, and proximity of vectors in this space defines similarity of texts.
Moreover, this process is powered by language models which aim to capture semantic meaning of texts, in addition to grammatical structure. These models can be generic or domain-specific, so the search can be customized for domain-specific use cases.
From the integration standpoint, semantic search relies on an external service. Stardog system is not responsible for the lifecycle of the external semantic search system. The external service needs to be started separately and be accessible to Stardog. Stardog will update the contents of the vector database automatically as the contents of the Stardog database is updated. But if the external service is unavailable or inaccessible by Stardog then transactions will fail.
Currently an implementation based on txtai is supported in Beta state. Integration with other vector databases and semantic search systems will be provided in the future based on input and feedback from Stardog users.
Service Setup
txtai service runs as an external process and is not managed by Stardog.
The simplest way to run it is as a web application within Uvicorn web server.
Installation
pip3 install duckdb 'txtai[api]'
Running
CONFIG=/path/to/config.yml uvicorn txtai.api:app
By default, the server listens on localhost:8000. This can be overriden:
CONFIG=/path/to/config.yml uvicorn --host 127.0.0.1 --port 9000 txtai.api:app
Example config.yml file:
# Index file path
path: /tmp/index
# Allow indexing of documents
writable: True
# Enbeddings index
embeddings:
path: sentence-transformers/all-MiniLM-L6-v2
See Configuration section of txtai documentation for more details.
Enabling Semantic Search
From CLI
$ stardog-admin db create -o semantic.search.enabled=true -o semantic.search.api.endpoint=http://localhost:8000 -n myDb
Using Java
Semantic search can be enabled when creating database programmatically by setting semantic.search.enabled
and semantic.search.api.endpoint
options, for example:
dbms.newDatabase("embeddingsTest")
.set(EmbeddingsOptions.EMBEDDINGS_SEARCHABLE, true)
.set(EmbeddingsOptions.EMBEDDINGS_API_ENDPOINT, "http://localhost:8000")
.create();
Integration with SPARQL
Unlike full text search, only service form is supported for Semantic search. Example:
prefix fts: <tag:stardog:api:search:>
SELECT * WHERE {
service fts:semanticMatch {
[] fts:query 'city' ;
fts:threshold 0.4 ;
fts:result ?result ;
fts:score ?score ;
fts:limit 10 ;
}
}
Semantic search service is identified by tag:stardog:api:search:emanticMatch
URI and takes the following parameters:
Parameter Name | Description |
---|---|
query | string to query over a search index |
result | results received from the search index for a query |
score | calculated score between a query and a hit result |
threshold | threshold to include results with scores above or equal |
limit | limit of the size of the hit results |
Searching over Variable Bindings
Similarly to full text match, fts:query
parameter can be specified as a variable so that the input to semantic search is coming from other graph patterns in the query
prefix fts: <tag:stardog:api:search:>
SELECT * WHERE {
# descriptions of places, bound to ?description variable,
# will be used as inputs to semantic search
?place a :Place; :description ?description .
service fts:textMatch {
[] fts:query ?description ;
fts:score ?score ;
fts:result ?similarDescription ;
}
}
Customization of Indexing
Data Types to be Indexed
Semantic search uses the same option as full text search to control which datatypes are indexed: search.index.datatypes
.
From CLI
$ stardog-admin db create -o semantic.search.enabled=true -o semantic.search.api.endpoint=http://localhost:8000 -o search.index.datatypes=urn:String,urn:Date -n myDb
Using Java
// Create a database with semantic search index with specific data types
List<IRI> dataTypeList = Lists.newArrayList(
Datatype.STRING.iri(),
Datatype.DATE.iri(),
Datatype.DATETIME.iri());
dbms.newDatabase("embeddingsTest")
.set(EmbeddingsOptions.EMBEDDINGS_SEARCHABLE, true)
.set(EmbeddingsOptions.EMBEDDINGS_API_ENDPOINT, "http://localhost:8000")
.set(SearchOptions.INDEX_DATATYPES, dataTypeList)
.create();
Transactional updates
Since semantic search integrates with external system, this system needs to be up to date with Stardog’s internal index.
By default, when semantic search is enabled, literals added during a transaction are indexed in a semantic search system as well. This can be disabled by setting semantic.search.reindex.tx
to false. Doing so will cause semantic search to return incomplete results until the index is rebuilt explicitly.
From CLI
stardog-admin db create -o semantic.search.enabled=true -o semantic.search.api.endpoint=http://localhost:8000 -o semantic.search.reindex.tx=false -n myDb
Using Java
dbms.newDatabase("embeddingsTest")
.set(EmbeddingsOptions.EMBEDDINGS_SEARCHABLE, true)
.set(EmbeddingsOptions.EMBEDDINGS_API_ENDPOINT, "http://localhost:8000")
.set(EmbeddingsOptions.EMBEDDINGS_REINDEX_TX, false)
.create();
To rebuild semantic search index, an optimize command must be issued explicitly:
From CLI
stardog-admin db optimize myDb
Using Java
dbms.optimize("embeddingsTest");
Force-rebuilding of semantic search index
The optimize.search
option forces semantic search index to be rebuilt by optimize command no matter what state the database is.
From CLI
stardog-admin metadata set -o optimize.search=true -- myDb
stardog-admin db optimize myDb
Using Java
dbms.optimize("embeddingsTest", Metadata.of(SearchOptions.OPTIMIZE, true));