Stardog Voicebox Developer Guide
This chapter provides information to developers about how to make sure Stardog Voicebox will work effectively and accurately.
Page Contents
Stardog Voicebox is available to use with your own data for Essentials and Enterprise customers.
Overview
This developer’s guide provides information about how to make sure Stardog Voicebox will work effectively and accurately.
Getting Started
In order to use Voicebox with your own data, follow these steps:
- Create your data model and connect to your data sources in Designer.
- Publish your data model into a new database with Voicebox enabled.
- Generate example questions in Explorer, Studio or the CLI.
That’s it! Once you follow these steps, you can go to the Voicebox tab in Stardog Cloud and start having a conversation.
When you are publishing your data model to a new database, make sure the Voicebox checkbox is selected as shown here:
The Voicebox option will only be available if 1) your Stardog installation has a license that allows Voicebox, and 2) you have not exceeded the number of Voicebox databases allowed in your license.
If you would like to use Voicebox with an existing database, you have to set the database option voicebox.enabled
manually. See the Database Configuration section for more details on this topic.
Once you publish your data model, example stored queries will be created automatically for you. You will be notified via email when these examples are ready. In order to generate example queries, Voicebox will inspect the instance data within the database. If you are publishing your data model in Designer with associated CSV files, no further action is needed. If you would like to import the instance data using other means, you can publish your data model again after the data is imported for the examples to be recreated again.
In addition to the auto-generated queries, it is strongly suggested to create manual example queries. See the Stored Queries for more details about adding example queries, and make sure you follow the guidelines outlined below when creating these examples.
Data Modeling
Voicebox is developed to work with data models created by Designer but you can bring your own data models created elsewhere as long as they conform to the guidelines outlined in this section. The key information used by Voicebox in data models to generate SPARQL queries are:
- Labels and descriptions of classes and properties. For schema elements only
rdfs:label
andrdfs:comment
are used for labels and comments respectively. Thelabel.properties
does not have an effect on schema. - Class hierarchy as defined by
rdfs:subClassOf
. - Property domain and range information as defined by
so:domainIncludes
orso:rangeIncludes
. - Prefixes must be unique. Two IRIs that are referenced from more than one prefix (esp. The default
_:
prefix) may confuse the LLM.
Data models should be clear and as concise as possible for Voicebox to make sense of the model and successfully generate SPARQL queries. General data modeling best practices apply for Voicebox too:
- Make data models as concise as possible. Avoid using very generic and overly complicated models such as Basic Formal Ontology (BFO) or Common Core Ontologies (CCO).
- Use clear and understandable labels for classes and properties. Avoid abbreviations or ambiguous terms in labels.
- Avoid using literals (attributes) in your data model as much as possible and instead use IRIs (relationships) in your data model. For example, the
state
property in anAddress
class should be an IRI and not a literal. IRIs allow Voicebox to link query results to graph elements, which is not possible with literals (see Stored Queries section for more discussion on this topic). - If you are using attributes, make sure you use the most appropriate type. For example, use
xsd:date
orxsd:integer
where appropriate instead ofxsd:string
. - Voicebox recognizes
rdfs:domain
andrdfs:range
, but you should avoid using these constructs in data models as much as possible. Assigning multiple domains and ranges with the RDFS vocabulary has unintended consequences for reasoning, so useso:domainIncludes
orso:rangeIncludes
instead.
Needless to say, the data model should reflect how the instance data is represented. Ideally, there are SHACL constraints associated with the data model so that the instance data can be validated with respect to these constraints. Minimally the domains and ranges of attributes and relationships should be validated.
Once a new version of a model is deployed, delete or replace the prior model version. They may not coexist, even when an old version is deactivated.
Stored Queries
Voicebox uses stored queries as few-shot examples in LLM prompts. Once the natural language questions users expect to ask are identified, SPARQL queries to answer those questions can be stored within the database using Studio, Explorer, or the CLI.
Voicebox requires a minimum of 100 queries per database to increase its accuracy for query generation. For best accuracy, 500 queries per database is recommended. Voicebox will start answering questions with the existing stored queries even if they are below these numbers, but the failure rates will be higher and the answer accuracy will be lower. The variety of the queries matter as much as the number of queries. The queries, in aggregate, should cover all the elements in the data model.
Adding queries
For a stored query to be used by Voicebox, it should satisfy the following three conditions:
- The query is public; in other words, not marked as private.
- The query is associated with a specific database.
- The query has a metadata field,
system:voiceboxQuestion
, wheresystem
is the namespace (http://system.stardog.com/
), that assigns one or more natural language questions with the query. The metadata field is automatically added by Explorer and Studio as explained below.
The database should have the option voicebox.enabled=true
set before the queries are stored. If this option is enabled after stored queries are already added, those queries should be readded. See the next section that discusses details for indexing.
Studio and Explorer provide UI components to add the natural language questions:
The Voicebox metadata field can also be set via the command-line as well:
stardog-admin stored import ListProductBrands.ttl
where ListProductBrands.ttl
would look like (with DBNAME
and USERNAME
set to appropriate values):
@prefix stardog: <tag:stardog:api:> .
@prefix system: <http://system.stardog.com/> .
system:QueryQ10000 a system:StoredQuery , system:SharedQuery ;
system:queryName "ListProductBrands" ;
system:queryDatabase "DBNAME" ;
system:queryCreator "USERNAME" ;
system:voiceboxQuestion "List all Product Brands", "Find all the product brands" ;
system:queryString """SELECT DISTINCT ?productBrand0
WHERE {
?productBrand0 a scm:Product_Brand .
}""".
Query preprocessing
When a query matching the constraints listed in the previous section is stored in Stardog, the information about this stored query is automatically added to the named graph system:VoiceboxQuestions
after several preprocessing steps.
This named graph is updated in a background thread, so if multiple stored queries are being added with stardog-admin stored import
, it might take a few seconds or minutes for all the queries to appear in this named graph. Use stardog-admin ps list
to check the background process.
The stored query information in this named graph will look like this:
system:QueryMyQuery a system:StoredQuery , system:SharedQuery ;
system:queryName "MyQuery" ;
system:voiceboxQuestion "First natural language question" ,
"Second natural language question" ;
system:queryString """ ... SPARQL Query ... """ .
Whenever a stored query is updated, the information in this named graph will be updated automatically as well. Do not update this named graph directly, because the changes might be lost! Always update the stored query, and changes will be reflected in this named graph.
The SPARQL query string will look different from the stored query string. This is due to preprocessing steps applied automatically before the query info is copied to this named graph. These preprocessing steps make all stored queries have canonical formatting, and they automatically apply some of the guidelines described in the next section.
The preprocessing steps make changes to queries that may cause them to return a different set of results than the original version. These changes are meant to make queries more generic and reusable while fixing typical user mistakes. If queries are written by advanced users following the guidelines below, it is better to disable preprocessing by setting the database configuration option voicebox.preprocessors=noop
, as explained below.
In Stardog 10, the following preprocessors will be used by default:
- ReplaceConstantIris This preprocessor will detect entity IRIs used in queries, find their labels in the data, and replace IRIs with a variable linked to the label using
stardog:label
. Type or property IRIs will not be affected. - ReplaceLabelFilters This preprocessor will detect constants or equality filters used by
rdfs:label
and replace them withstardog:label
. - RenameVariables This preprocessor will replace Explorer-generated variables names that look like
?subjN
,?subj0
, or?obj0
with meaningful names based on types and properties used in the query. - LabelPropertyFunction This preprocessor ensures the
stardog:label
andstardog:property:textMatch
usage in queries are serialized using property function syntax and not the SERVICE syntax. - OrderJoins This preprocessor reorders the triple patterns in the query so related triple patterns are grouped together.
The preprocessing can be controlled by the database configuration option voicebox.preprocessors
. The default value for this option will use all the preprocessors described above. Setting this option to noop
will disable all preprocessing steps. This option can also be set to a comma-separated list of preprocessor names to choose specific preprocessors.
Guidelines for writing queries
Follow these guidelines when creating stored queries:
- Voicebox only supports SELECT queries. Only create SELECT queries.
- Avoid duplicates in results, and use DISTINCT as much as possible.
- Always test your queries by running them to make sure they return the expected answers.
- Questions can be written as instructions, too.
- Question: “Who is the oldest person?”
- Instruction: “Find the oldest person”.
- Queries should be about instance data. Do not write queries about schema.
- Do not include redundant conditions in your queries that are not necessary to answer the question. For example, if your question is “Which spacecrafts have hyperdrive rating greater than 1?”, the query you build should not include anything about ship capacity, length, manufacturer, etc. Remove any pattern or constraint from the query that does not contribute to the result.
- The queries should always return IRIs and not labels. Voicebox retrieves the labels for IRIs before presenting the results to the user and links those labels with the IRIs. If queries don’t return IRIs, links will be missing in the results.
- Projection variables in the query should match the question. For example, if the query is asking “Find all the orders that …”, the query should have the projection “SELECT DISTINCT ?order”. If you are building the query in Explorer, pay attention to the “Results shown” selection and uncheck any type that is not asked in the question. One exception to this rule is about entities mentioned in the question. See the next bullet point for an example.
- Do not use constants in your queries for entities. This means do not use the “In” operator in Explorer queries, and do not use IRIs (other than for properties and classes) in Studio queries. Instead, the query should find the entity (typically using the label of the entity). For example, if the question is “Find movies directed by Steven Spielberg.”, the query should look like this in Studio:
SELECT DISTINCT ?movie ?director WHERE { ?movie a so:Movie . ?movie so:director ?director . ?director stardog:label "Steven Spielberg" . }
or like this in Explorer:
- Use
stardog:label
in queries instead ofrdfs:label
.stardog:label
is a new service added in Stardog 10 that finds entities matching a label (or vice versa). It takes advantage of full-text search, so it can do fuzzy matching. It can also look up multiple label properties if thelabel.properties
option is configured to do so. - If the question is asking about a count, the query should be an aggregation query returning the count. For example, if the question is asking “How many products were sold in selling point X?”, the query should not return the products, but just the count. For aggregation queries that use GROUP BY, the grouping variables may be returned. For example, the query “How many products were sold each year?” would have a
GROUP BY ?year
expression, and the variable?year
should be returned. Conversely, if the query is not asking about a count explicitly, the query should return the entities and not a count. - There should not be any extraneous ORDER BY, GROUP BY, or LIMIT in the queries if they are not required by the question. For example, if the question is asking “What are the top three …” or “Show me five …”, the LIMIT in the query is justified. Otherwise, there should not be a LIMIT, even if the query returns a very large number of results.
- Use meaningful variable names in your queries instead of generic names like
?x
,?y
, or unnamed variables like bnodes []. When queries are generated by Explorer, the auto-generated variable names like?subj0
and?obj0
are used, but Voicebox automatically renames those variables based on types and properties used in the query. - Few-shot query generation will not take into account virtualized data at this time. We are planning to include this functionality in a future release.
DOs and DON’Ts
The following table summarizes the above advice:
DOs | DON’Ts |
---|---|
Create SELECT queries | Create UPDATE queries |
Use DISTINCT as much as possible | Have duplicates in your results |
Run your queries to make sure they return the expected answers | – |
Write some of your questions as instructions | – |
Write questions about instance data | Write questions about schema |
– | Include conditions in your query that aren’t required to answer the question |
Return IRIs | Return labels |
Select the variables mentioned in your question | Select variables not mentioned in your question |
Find the entity your query is looking for | Use constants for entities |
Use stardog:label | Use rdfs:label |
Aggregate queries that ask for a count | Include other variables when you ask for the count |
– | Use ORDER BY, GROUP BY, or LIMIT if they are not required by the question |
Use meaningful variable names | Use variable names like ?x or bnodes |
Database Configuration
The database configuration option voicebox.enabled
needs to be set to true for Voicebox to work with a database. This option can be set at any time, but it is best practice to set this option at database creation time.
This option can be set to true only if the license for the Stardog server allows Voicebox to be enabled. Stardog licenses have a metadata field voicebox.count.limit
that specifies the maximum number of databases that can have Voicebox enabled at any time. You can run stardog-admin license info
to see this field for your license.
If a database is created with Stardog 9 or earlier, the default reasoning schema is set to tag:stardog:api:context:local
, which is a built-in wildcard. This is not compatible with Voicebox. The default reasoning schema should be set to one or more specific named graphs before Voicebox can be enabled. Use the command stardog reasoning schema -–list DB
to see the list of schemas and their associated graphs and the command stardog reasoning schema -–add
to update the schema graphs.
If named graph security is enabled for the database, all users should be given read access to the named graph http://system.stardog.com/VoiceboxQuestions
. If a user does not have access to this named graph, they will not be able to see the example queries, and Voicebox will not be able to answer their questions accurately. See Query Preprocessing for more details about this named graph.
Setting the option voicebox.enabled
will automatically trigger various other database options to be configured as well. No further administrator action is required for the additional options to be set. Information about these additional options are provided below for completeness.
search.enabled=true
- This option enables full-text search for the database. It is required for Voicebox to find entities referenced in user questions, among other tasks. This search uses Lucene’s lexical matching capabilities.
search.index.contexts.filter=tag:stardog:api:context:local
search.index.contexts.excluded=false
- The above two options together enable full-text search over all the named graphs in the database.
search.semantic.enabled=true
search.semantic.index.contexts.filter=http://system.stardog.com/VoiceboxQuestions, [all graphs of the default schema]
- This option enables semantic search over specific named graphs in the database. Semantic search is powered by vector search and requires each text literal to be turned into vectors for indexing. For this reason, it is only enabled over the special named graph that contains stored queries (see the Stored Queries section below) plus reasoning schema graphs.
search.semantic.model=””
- This option specifies which language model should be used by semantic search to turn text values into vectors. The default is an empty string, which means the default language model included in the Stardog distribution should be used. This default model is all-MiniLM-L6-v2. However, any other language model from HuggingFace can be used as an alternative.
- A valid value looks like this:
search.semantic.model="djl://ai.djl.huggingface.pytorch/sentence-transformers/paraphrase-albert-small-v2"
. You can see more about supported protocols and address variants here.
- A valid value looks like this:
- This option specifies which language model should be used by semantic search to turn text values into vectors. The default is an empty string, which means the default language model included in the Stardog distribution should be used. This default model is all-MiniLM-L6-v2. However, any other language model from HuggingFace can be used as an alternative.
search.index.properties.excluded=http://system.stardog.com/queryString
- This option excludes SPARQL query strings from the vector index. SPARQL queries are typically long strings and costly to vectorize, and they don’t need to be included in the vector search.
search.index.compute.norm=true
- This option specifies that text lengths (norms) should be computed and indexed. This means text length will influence scores for full-text search and yield more accurate results.
label.properties=http://www.w3.org/2000/01/rdf-schema#label
- This option specifies which properties are used to associate labels with entities. When Voicebox is looking for an entity mentioned in a user question, it will only consider the options specified here to find the matching entities.
reasoning.schema.versioning.enabled=true
- This option enables internal change tracking functionality for reasoning schema graph. This allows Voicebox to detect that the schema for the database has been updated without inspecting the contents of the schema graphs.
- Because of schema versioning, Voicebox will automatically reflect changes to the default schema graphs, but it does not support serving from a historical schema snapshot, since that functionality is not covered by schema versioning.
- This option enables internal change tracking functionality for reasoning schema graph. This allows Voicebox to detect that the schema for the database has been updated without inspecting the contents of the schema graphs.
reasoning.precompute.non_empty.predicates=false
- This option disables a reasoning optimization that is not compatible with schema versioning. This optimization is most useful when database contents (schema or data) are mostly static, so disabling it does not have a negative impact on typical usage.
Common pitfalls
Below are some common issues users run into when using Stardog Voicebox:
- Creating no or too few example queries.
- Issues with namespaces
- Editing or removing the default namespace for a database.
- Using a prefix in a query that is not defined in the database’s namespace.
- Using a prefix in a query that is mapped to a different IRI in the database’s namespace.
- Using multiple prefixes for the same namespace. e.g., both
schema
andso
for<https://schema.org/>
.
- Issues with stored queries
- Training Voicebox on queries that are not stored queries. (Resolution: reload all stored queries)
- Deleting stored queries. (Resolution: reload all stored queries)
- Failing to process stored queries. (Resolution: process new queries after they’re stored)
- Using an example query that’s too long.
- Using classes or properties in training queries that are not in an ontology.
A diagnostics tool will soon be available that will check for these issues.