Managing Query Performance
This page describes various ways to manage and analyze query performance in Stardog.
Page Contents
Overview
Stardog answers queries in two major phases: determining the query plan and executing that plan to obtain answers from the data. The former is called query planning (or query optimization) and includes all steps required to select the most efficient way to execute the query. How Stardog evaluates a query can only be understood by analyzing the query plan. Query plan analysis is also the main tool for investigating performance issues, as well as addressing them (in particular, by re-formulating the query to make it more amenable to optimization).
Query Plan Syntax
We will use the following running example to explain query plans in Stardog.
SELECT DISTINCT ?person ?name
WHERE {
?article rdf:type bench:Article .
?article dc:creator ?person .
?inproc rdf:type bench:Inproceedings .
?inproc dc:creator ?person .
?person foaf:name ?name
}
This query returns the names of all people who have authored both a journal article and a paper in a conference proceedings. The query plan used by Stardog to evaluate this query is:
Distinct [#812K]
`─ Projection(?person, ?name) [#812K]
`─ MergeJoin(?person) [#812K]
+─ MergeJoin(?person) [#391K]
│ +─ Sort(?person) [#391K]
│ │ `─ MergeJoin(?article) [#391K]
│ │ +─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]
│ │ `─ Scan[PSOC](?article, dc:creator, ?person) [#898K]
│ `─ Scan[PSOC](?person, foaf:name, ?name) [#433K]
`─ Sort(?person) [#503K]
`─ MergeJoin(?inproc) [#503K]
+─ Scan[POSC](?inproc, rdf:type, bench:Inproceedings) [#255K]
`─ Scan[PSOC](?inproc, dc:creator, ?person) [#898K]
The plan is arranged in a hierarchical, tree-like structure which grows left to right. The nodes, called operators, represent units of data processing during evaluation. They correspond to evaluations of graph patterns or solution modifiers as defined in SPARQL 1.1 specification. All operators can be regarded as functions which may take some data as input and produce some data as output. All input and output data is represented as streams of solutions, that is, sets of bindings of the form x -> value
where x
is a variable used in the query and value
is some RDF term (IRI, blank node, or literal). Examples of operators include scans, joins, filters, unions, etc.
Numbers in square brackets after each node refer to the estimated cardinality of the node, i.e. how many solutions Stardog expects this operator to produce when the query is evaluated. Statistics-based cardinality estimation in Stardog merits a separate blog post, but here are the key points for the purpose of reading query plans:
- All estimations are approximate, and their accuracy can vary greatly (generally: more precise for bottom nodes, less precise for upper nodes).
- Estimations are only used for selecting the best plan but have no bearing on the actual results of the query.
- In most cases, a sub-optimal plan can be explained by inaccurate estimations.
Stardog Evaluation Model
Stardog generally evaluates query plans according to the bottom-up SPARQL semantics. Leaf nodes are evaluated first and without input, and their results are then sent to their parent nodes up the plan. Typical examples of leaf nodes include scans, i.e. evaluations of triple patterns, evaluations of full-text search predicates, and VALUES
operators. They contain all information required to produce output, for example, a triple pattern can be directly evaluated against Stardog indexes. Parent nodes, such as joins, unions, or filters, take solutions as inputs and send their results further towards the root of the tree. The root node in the plan, which is typically one of the solution modifiers, produces the final results of the query which are then encoded and sent to the client.
Pipelining And Pipeline Breakers
Stardog implements the Volcano model, in which evaluation is as lazy as possible. Each operator does just enough work to produce the next solution. This is important for performance, especially for queries with a LIMIT
clause (of which ASK
queries are a special case) and also enables Stardog’s query engine to send the first result(s) as soon as they are available (as opposed to waiting till all results have been computed).
Not all operators can produce output solutions as soon as they get first input solutions from their children nodes. Some need to accumulate intermediate results before sending output. Such operators are called pipeline breakers, and they are often the culprits for performance problems, typically resulting from memory pressure. It is important to be able to spot them in the plan since they can suggest either: a) a way to re-formulate the query to help the planner, or b) a way to make the query more precise by specifying extra constants where they matter.
Here are some important pipeline breakers in the example plan:
HashJoin
: hash join algorithms build a hash table for solutions produced by the right operand. Typically all such solutions need to be hashed, either in memory or spilled to disk, before the first output solution is produced by theHashJoin
operator.Sort
: the sort operator builds an intermediate sorted collection of solutions produced by its child node. The main use case for sorting solutions is to prepare data for an operator which can benefit from sorted inputs, such asMergeJoin
,Distinct
, orGroupBy
. All solutions have to be fetched from the child node before the smallest (w.r.t. the sort key) solution can be emitted.GroupBy
: group-by operators are used for aggregation, e.g. counting or summing results. When evaluating a query likeselect ?x (count(?y) as ?count) where { ... } group by ?x
Stardog has to scroll through all solutions to compute the count for every?x
key before returning the first result.
Other operators can produce output as soon as they get input:
MergeJoin
: merge join algorithms do a single zig-zag pass over sorted streams of solutions produced by children nodes and output a solution as soon as the join condition is satisfied.DirectHashJoin
: contrary to the classical hash join algorithm, this operator does not build a hash table. It utilizes Stardog indexes for look-ups which doesn’t require extra data structures. This is only possible when the right operand is sorted by the join key, but the left isn’t, otherwise Stardog would use a merge join.Filter
: a solution modifier which evaluates the filter condition on each input solution.Union
: combines streams of children solutions without any extra work, e.g. joining, so there’s no need for intermediate results.
Now, returning to the above query, one can see Sort
pipeline breakers in the plan:
Sort(?person) [#391K]
`─ MergeJoin(?article) [#391K]
+─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]
`─ Scan[PSOC](?article, dc:creator, ?person) [#898K]
This means that all solutions representing the join of ?article rdf:type bench:Article
and ?article dc:creator ?person
will be put in a sequence ordered by the values of ?person
. Stardog expects to sort 391K
solutions before they can be further merge-joined with the results of the ?person foaf:name ?name
pattern. Alternately, the engine may build a hash table instead of sorting solutions; such decisions are made by the optimizer based on a number of factors.
Skipping Intermediate Results
One tricky part of understanding Stardog query plans is that evaluation of each operator in the plan is context-sensitive, i.e. it depends on what other nodes are in the same plan, maybe in a different sub-tree. In particular, the cardinality estimations, even if assumed accurate, only specify how many solutions the operator is expected to produce when evaluated as the root node of a plan.
However, as it is joined with other parts of the plan, the results can be different. This is because Stardog employs optimizations to reduce the number of solutions produced by a node by pruning those which are incompatible with other solutions with which they will later be joined.
Consider the following basic graph pattern and the corresponding plan:
?erdoes rdf:type foaf:Person .
?erdoes foaf:name "Paul Erdoes"^^xsd:string .
?document dc:creator ?erdoes .
MergeJoin(?erdoes) [#10]
+─ MergeJoin(?erdoes) [#1]
│ +─ Scan[POSC](?erdoes, rdf:type, foaf:Person) [#433K]
│ `─ Scan[POSC](?erdoes, foaf:name, "Paul Erdoes") [#1]
`─ Scan[POSC](?document, dc:creator, ?erdoes) [#898K]
The pattern matches all documents created by a person named Paul Erdoes. Here the second pattern is selective (only one entity is expected to have the name “Paul Erdoes”). This information is propagated to the other two scans in the plan via merge joins, which allows them to skip scanning large parts of data indexes.
In other words, the node Scan[POSC](?erdoes, rdf:type, foaf:Person) [#433K]
will not produce all 433K
solutions corresponding to all people in the database and, similarly, Scan[POSC](?document, dc:creator, ?erdoes) [#898K]
will not go through all 898K
document creators.
SPARQL Profiler
Stardog 7.7.1+ provides a query profiler to help diagnose performance problems and identify computational bottlenecks in the query plan. Performance problems usually arise for four reasons:
- complexity of the query itself, especially the amount of returned data.
- failure to select a good plan for the query.
- server overload, e.g. lack of memory or CPU resources due to many concurrent queries or transactions.
- environment-related problems, e.g. slow disks or lack of resources due to other applications running on the same hardware.
When a query runs slow, the first important thing is to place the issue into one of the above categories. This section is concerned with reasons 1 and 2. The query plan, particularly augmented with profiling data, can help distinguish between them. The first kind of problems is usually fixed by making the query more selective while the second typically requires query hints or reformulations without changing the meaning of the query, i.e. the results.
Here’s an example of a non-selective query:
SELECT DISTINCT ?name1 ?name2
WHERE {
?article1 rdf:type bench:Article .
?article2 rdf:type bench:Article .
?article1 dc:creator ?author1 .
?author1 foaf:name ?name1 .
?article2 dc:creator ?author2 .
?author2 foaf:name ?name2 .
?article1 swrc:journal ?journal .
?article2 swrc:journal ?journal
FILTER (?name1<?name2)
}
The query returns all distinct pairs of authors who published (possibly different) articles in the same journal. It returns more than 18M results from a database of 5M triples. Using the stardog query explain --profile {db} {query}
CLI command, one can get the following plan with profiling information (prefixes omitted for brevity):
Profiling results:
Query executed in 57934 ms and returned 18362955 result(s)
Total used memory: 1.6G
Pre-execution time: 182 ms (0.3%)
Post-processing time: 7270 ms (12.5%)
prefix : <http://api.stardog.com/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix stardog: <tag:stardog:api:>
From local
From named local named
Distinct [#17.7M], memory: {total=1.5G (96.1%)}, results: 18.4M, wall time: 15985 ms (27.6%)
`─ Projection(?name1, ?name2) [#17.7M], results: 18.5M, wall time: 2459 ms (4.2%)
`─ Filter(?name1 < ?name2) [#17.7M], results: 18.5M, wall time: 16092 ms (27.8%)
`─ HashJoin(?journal) [#35.4M], memory: {total=22M (1.3%)}, results: 37.4M, wall time: 10467 ms (18.1%)
+─ MergeJoin(?author2) [#390K], results: 391K, wall time: 259 ms (0.4%)
│ +─ Scan[PSOC](?author2, <http://xmlns.com/foaf/0.1/name>, ?name2) [#433K], results: 385K (with gaps), wall time: 233 ms (0.4%)
│ `─ Sort(?author2) [#390K], memory: {total=21M (1.3%)}, results: 391K (with gaps), wall time: 388 ms (0.7%)
│ `─ NaryJoin(?article2) [#390K], results: 391K, wall time: 193 ms (0.3%)
│ +─ Scan[PSOC](?article2, <http://purl.org/dc/elements/1.1/creator>, ?author2) [#898K], results: 395K (with gaps), wall time: 154 ms (0.3%)
│ +─ Scan[PSOC](?article2, <http://swrc.ontoware.org/ontology#journal>, ?journal) [#208K], results: 208K (with gaps), wall time: 84 ms (0.1%)
│ `─ Scan[POSC](?article2, rdf:type, <http://localhost/vocabulary/bench/Article>) [#208K], results: 208K (with gaps), wall time: 97 ms (0.2%)
`─ MergeJoin(?author1) [#390K], results: 391K, wall time: 202 ms (0.3%)
+─ Scan[PSOC](?author1, <http://xmlns.com/foaf/0.1/name>, ?name1) [#433K], results: 385K (with gaps), wall time: 200 ms (0.3%)
`─ Sort(?author1) [#390K], memory: {total=21M (1.3%)}, results: 391K (with gaps), wall time: 447 ms (0.8%)
`─ NaryJoin(?article1) [#390K], results: 391K, wall time: 249 ms (0.4%)
+─ Scan[PSOC](?article1, <http://purl.org/dc/elements/1.1/creator>, ?author1) [#898K], results: 395K (with gaps), wall time: 221 ms (0.4%)
+─ Scan[POSC](?article1, rdf:type, <http://localhost/vocabulary/bench/Article>) [#208K], results: 208K (with gaps), wall time: 110 ms (0.2%)
`─ Scan[PSOC](?article1, <http://swrc.ontoware.org/ontology#journal>, ?journal) [#208K], results: 208K (with gaps), wall time: 128 ms (0.2%)
The profiler reports the following:
- How long the query was executed on the server side and how many results it generated. The reported time is the wall-clock (not CPU) time and it does not include the time to send results to the client (because the profiler only returns the plan). If the query fails to complete, for example, due to a timeout, the first line of the profiler’s output will contain information about the error (more details can be found in the server’s log).
- Total amount of memory used by the server (1.6G in this case). Note: this only accounts for memory managed by Stardog and not objects on the JVM heap. External tools and GC-related JVM arguments can be used to estimate that.
- Pre-execution time: this is time spent on selecting the query plan. If reasoning is enabled, it will also include the time to rewrite the query according to the schema.
- Post-processing time: Stardog uses dictionary encoding for RDF terms in the data to speed-up query processing, such as joins. Post-processing accounts for the overhead to convert the final query results back to IRIs, bnodes, and literals before shipping to the client.
- the number of intermediate query results generated by each operator in the plan. The profiler detects when some part of the operator’s output is skipped over and adds “with gaps” next to the number of results.
- Wall time for each operator in the plan (in ms). It does not include time spent in the child operators; those are reported separately.
- Total amount of allocated managed memory used by an operator.
- That is reported for pipeline-breaking operators which materialize intermediate results in memory. If an operator has to spill some of the results to disk, the amount of disk IO is also reported in the same line.
- In case the operator spilled, the total amount of allocated memory is reported, along with the maximum used memory at any time. Operators spilling to disk may return some memory to the global pool. This avoids that concurrent queries operate without any memory at all.
The above plan reveals several important points:
- The query returns a lot of results (
18M
) which by itself is a performance problem (encoding them in an HTTP response, transmitting, and parsing on the client side can be expected to take noticeable amount of time, too). - The most expensive operators are
Filter(?name1 < ?name2)
,Distinct
, andHashJoin(?journal)
. Cumulatively they account for nearly75%
of the execution time. This is not surprising given the number of results. - Actually reading data from disk in
Scan
operators takes relatively little time, i.e. this is a CPU/memory-bound query, not a disk IO-bound query so faster disks won’t help. - The cardinality estimations (the numbers in square brackets) are pretty accurate for operators in the plan, i.e. close to the number of results actually generated. This suggests that it is likely the best plan for the query.
A common cause of suboptimal query plans is cardinality mis-estimations, i.e. when the number of results that the optimizer thought an operator would generate and the actual number of results are several orders of magnitude apart.
Given all this information from the plan, the only reasonable way to address the problem would be to restrict the criteria, e.g. to particular journals, people, time periods, etc., or add LIMIT
to the bottom of the query.
Current Limitations
The profiler has the following limitations, some of which may be lifted in the future:
- Profiling is only supported in CLI via the
query explain --profile
command or directly via HTTP. - Profiling is only supported for read queries (with the exception of path queries).
DELETE/INSERT...WHERE
queries can be profiled by re-writing them asCONSTRUCT
queries with the same graph template andWHERE
pattern. - Detailed profiler data is not collected for a small number of operators, particularly, bind joins and filters with
EXISTS
expressions. The profiler will report execution time for such operators but not for their child operators (in case of bind join) or operators insideEXISTS
. The reason is that such operators can dynamically change their execution plan based on external bindings. That is also true for path queries.
Live Query Profiling
Using the query explain CLI command with the --profile
option, one starts a query with additional instrumentation for profiling. Via the same command it is also possible to inquire about the current live state of the query via the query ID.
The following two steps are necessary to do so:
- Using
stardog-admin query list
to get the list of the currently running queries. - Provide the ID to the explain command to get the current query plan including live profiling information:
stardog query explain -- {db} {query_id}
.
Diagnosing Performance Problems Without The Profiler
If the profiler is not available (e.g., in older versions of Stardog), one can still examine the query plan to figure out the issues. For example, the vanilla query plan for the above query suggests that HashJoin(?journal)
generates a very high number of results. Thus one can deduce that the filtering and de-duplication in Distinct
is going to be expensive. It is then clear that the query needs either more selective conditions or a LIMIT
to get faster.
If a query is well-formulated and selective but performance is unsatisfactory, one may look closer at the pipeline breakers, e.g. this part of the query plan:
MergeJoin(?person) [#391K]
+─ Sort(?person) [#391K]
| `─ MergeJoin(?article) [#391K]
| +─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]
| `─ Scan[PSOC](?article, dc:creator, ?person) [#898K]
`─ Scan[PSOC](?person, foaf:name, ?name) [#433K]
A reasonable thing to do would be to evaluate the join of ?article rdf:type bench:Article
and ?article dc:creator ?person
separately, i.e. as a plan query, to see if the estimation of 391K
is reasonably accurate. That would provide an estimate for how long it takes or for the memory footprint. This is a valuable piece of information for a support ticket, especially when the data cannot be shared with us. Similar analysis can be done for hash joins.
If the query plan indicates that there are expensive operators, one can use query hints to change the query plan. Particularly useful is the #pragma group.joins hint which allows changing the join order so that expensive pipeline-breaking joins (e.g. hash joins) would happen later (that is, higher) in the execution tree before more selective operators reduce the number of intermediate results. One can also disable particular join algorithms using the #pragma join.{type} off
hints.
In addition to pipeline breakers, there could be other clear indicators of performance problems. One of them is the presence of NestedLoopJoin
nodes in the plan. Stardog implements the nested loop join algorithm which evaluates the join by going through the Cartesian product of its inputs. This is the slowest join algorithm, and it is used only as a last resort. It sometimes, but not always, indicates a problem with the query.
Here’s an example:
SELECT DISTINCT ?person ?name
WHERE {
?article rdf:type bench:Article .
?article dc:creator ?person .
?inproc rdf:type bench:Inproceedings .
?inproc dc:creator ?person2 .
?person foaf:name ?name .
?person2 foaf:name ?name2
FILTER (?name=?name2)
}
The query is similar to an earlier query plan we saw but runs much slower. The plan shows why:
Distinct [#98456.0M]
`─ Projection(?person, ?name) [#98456.0M]
`─ Filter(?name = ?name2) [#98456.0M]
`─ NestedLoopJoin(_) [#196912.1M]
+─ MergeJoin(?person) [#391K]
│ +─ Sort(?person) [#391K]
│ │ `─ MergeJoin(?article) [#391K]
│ │ +─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]
│ │ `─ Scan[PSOC](?article, dc:creator, ?person) [#898K]
│ `─ Scan[PSOC](?person, foaf:name, ?name) [#433K]
`─ MergeJoin(?person2) [#503K]
+─ Sort(?person2) [#503K]
│ `─ MergeJoin(?inproc) [#503K]
│ +─ Scan[POSC](?inproc, rdf:type, bench:Inproceedings) [#255K]
│ `─ Scan[PSOC](?inproc, dc:creator, ?person2) [#898K]
`─ Scan[PSOC](?person2, foaf:name, ?name2) [#433K]
The loop join near the top of the plan computes the Cartesian product of the arguments which produces almost 200B solutions. This is because there is no shared variable between the parts of the query which correspond to authors of articles and conference proceedings papers, respectively. The filter condition ?name = ?name2
cannot be transformed into an equi-join because the semantics of term equality used in filters is different from the solution compatibility semantics used for checking join conditions.
The difference manifests itself in the presence of numerical literals, e.g. "1"^^xsd:integer
= "1.0"^^xsd:float
, where they are different RDF terms. However, as long as all names in the data are strings, one can re-formulate this query by renaming ?name2
to ?name
, which would enable Stardog to use a more efficient join algorithm.
Query Plan Operators
The following operators are used in Stardog query plans. Recall that if an operator has arguments (operators which generate input results), they are rendered left to right, i.e. the left operand is rendered first and the right is rendered second (under it with the same indentation).
Operator | Description |
---|---|
Scan[Index] | evaluates a triple/quad pattern against Stardog indexes. Indicates the index used, e.g. CSPO or POSC , where S,P,O,C stand for the kind of lexicographic ordering of quads that the index provides. SPOC means that the index is sorted first by Subject, then Predicate, Object, and Context (named graph IRI) |
HashJoin(join key) | hash join algorithm, builds a hash table for the right operand, linearly scans the left operand. Pipeline breaker. |
BindJoin(join key) | bind join algorithm, linearly scans the left operand and for each result replaces join key variables in the right operator by their values in the current left result. Can be seen as an optimization of the nested loop join for the case when the left operator produces far fewer results than the right. Not a pipeline breaker. |
BlockBindJoin(join key) | an optimized version of the bind join that evaluates the right operand for a block of join key values from the left operand (rather than a single row of join key values at a time). Not a pipeline breaker. |
DirectHashJoin(join key) | a hash join algorithm which directly uses indexes for lookups instead of building a hash table. Not a pipeline breaker. |
MergeJoin(join key) | merge join algorithm, the fastest option for joining two streams of solutions. Requires both operands be sorted on the join key. Not a pipeline breaker. |
NaryJoin(join key) | same as MergeJoin but for N operators sorted on the same join key. |
NestedLoopJoin | the nested loop join algorithm, the slowest join option. The only join option when there is no join key. Not a pipeline breaker. |
Shortest|All(Cyclic)Paths | Path operators. |
Sort(sort key) | sorts the argument solutions by the sort key, typically used as a part of a merge join. Pipeline breaker. |
Filter(condition) | filters argument solutions according to the condition. Not a pipeline breaker. |
Union | combines streams of argument solutions. If both streams are sorted by the same variable, the result is also sorted by that variable. Not a pipeline breaker. |
Minus | Removes solutions from the left operand that are compatible with solutions from the right operand. Pipeline breaker. |
PropertyPath | evaluates a property path pattern against Stardog indexes. Not a pipeline breaker. |
GroupBy | groups results of the child operator by values of the group-by expressions (i.e. keys) and aggregates solutions for each key. Pipeline breaker (unless the input is sorted by first key). |
Distinct | removes duplicate solutions from the input. Not a pipeline breaker, but accumulates solutions as it runs so the memory pressure increases with the number of unique solutions. |
VALUES | produces the inlined results specified in the query. Not a pipeline breaker. |
Search | evaluates a full-text search predicates against the Lucene index within a Stardog database. |
Projection | projects variables as results of a query or a sub-query. Not a pipeline breaker. |
Bind | evaluates expressions on each argument solution and binds their values to (new) variables. Not a pipeline breaker. |
Unnest | unnest array expressions. See the UNNEST operator. Not a pipeline breaker. |
Empty and Singleton | correspond to the empty solution set and a single empty solution, respectively. |
Type | reasoning operator for evaluating patterns of the form ?x rdf:type ?type or :instance rdf:type ?type . Not a pipeline breaker. |
Property | operator for evaluating triple patterns with unbound predicate with reasoning. Not a pipeline breaker, but could be very expensive especially for large schemas. Better be avoided by either using an IRI in the predicate position or turning off reasoning for such patterns using a hint. |
Service | SPARQL federation operator which evaluate a pattern against a remote SPARQL endpoint or a virtual graph. |
ServiceJoin(join key) | a join algorithm used when one of the operands is a Service (see above). Propagates bindings from the left operand to the Service to reduce the number of results coming over the network. |
MultiWayServiceJoin(join key) | Same as ServiceJoin but the join is evaluated over the union of multiple virtual graph Service operands. This can reduce the number of requests to Virtual Graphs. |
Slice(offset=<>, limit=<>) | combines LIMIT and OFFSET solution modifiers in SPARQL. |
OrderBy | an operator which implements the ORDER BY solution modifier in SPARQL. |
Describe | a SPARQL Describe operator. |
ADD , CLEAR , COPY , LOAD , MOVE , DELETE , DELETE DATA , INSERT , INSERT DATA | SPARQL Update operators. |
Query Hints
Query hints are not a part of the SPARQL Query Language. It is a mechanism to communicate an extra information to Stardog’s query engine, usually the query optimizer, which could affect the query execution process. The prime use case is improving query plans for performance reasons.
Query Hint Syntax
Query hints can be expressed in queries in two ways:
-
as SPARQL comments started with the
pragma
keyword:#pragma <hint name> <hint value>
There is no space between
#
andpragma
. -
As SPARQL triple patterns of the form:
[] <tag:stardog:api:hint:{hint name}> {hint value}
The first approach makes hints transparent to other query processing tools (other than Stardog). The second approach is preferred when using 3rd party tools which do not preserve SPARQL comments.
Most hints apply to the scope of the query (i.e. the area within {}
) in which they are defined and all nested scopes. Some hints apply to the entire query in which case they are defined in the preamble, i.e. before the top-level query keyword (like SELECT
or CONSTRUCT
). They start with the #pragma
keyword followed by the hint name and then the value. Each hint must be placed on its own line. Values are case-insensitive.
Scope Hint Example
SELECT * {
{ #pragma group.joins
?person a :Person ;
:friend ?other .
}
?person :name ?name
}
Preamble Hint Example
#pragma describe.strategy CBD
DESCRIBE ?person {
?person :lives :LittleRock
}
Hints In Query Plans
Starting with version 7.9, Stardog keeps query hints in query plan format. Hints are represented as tree nodes like all other operators, and use the following syntax:
#pragma <hint_name>=<hint_value>
Or just:
#pragma <hint_name>
if the hint does not require any value.
One hint node can contain multiple hint definitions:
#pragma <hint_name1>=<hint_value1> <hint_name2>=<hint_value2> ...
If a hint is applied to a scope, the sub-plan representing that scope becomes a child of this hint:
Projection(?o, ?o2)
`─ #pragma equality.identity=?o,?o2 join.choice.strategy=Streaming
`─ Filter(?o = ?o2)
`─ NestedLoopJoin(_)
+─ Scan[SPO](?s, :p, ?o)
`─ Scan[SPO](?s, :q, ?o2)
For preamble hints, the whole plan becomes the child of a hint node:
#pragma describe.strategy=CBD
`─ Describe-CBD
`─ Projection(?person)
`─ Scan[SPO](?person, :lives, :LittleRock)
describe.strategy=CBD
Plans with query hints can also be executed, like any other plan queries.
General Query Optimization Hints
These hints are suggestions to Stardog’s query optimizer. They can impact the query plan selected by the optimizer, but they should not change query results.
equality.identity
The equality.identity
hint expects a comma-separated list of variables. It tells Stardog that these variables will be bound to RDF terms (IRIs, bnodes, or literals) for which equality coincides with identity (i.e. any term is equal only to itself). This is not true for literals of certain numerical datatypes (cf. Operator Mapping). However, assuming that the listed variables do not take on values of such datatypes can sometimes lead to faster query plans (e.g., because of converting some filters to joins and through value inlining).
SELECT ?o ?o2 WHERE {
#pragma equality.identity ?o,?o2
?s :p ?o ;
:q ?o2
FILTER (?o = ?o2)
}
group.joins
Sometimes Stardog’s query planner can produce sub-optimal join orderings. The group.joins
hint introduces an explicit scoping mechanism to help with join order optimization. Patterns in the scope of the hint, given by the enclosing {}
, will be joined together before being joined with anything else. This way, you can tell the query planner what you think is the optimal way to join variables.
SELECT ?s WHERE {
?s :p ?o1 .
{
#pragma group.joins
#these patterns will be joined first, before being joined with the other pattern
?s :p ?o2 .
?o1 :p ?o3 .
}
}
push.filters
The push.filters
hint controls how the query optimizer pushes filters down the query plan. There are three possible values: default
, aggressive
, and off
.
- The
aggressive
option means that the optimizer will push every filter to the deepest operator in the plan, which binds variables used in the filter expression. - The
off
option turn the optimization off, and each filter will be applied to the top operator in the filter’s graph pattern (in case there are multiple filters, their order is not specified). - The
default
option (or absence of the hint) means that the optimizer will decide whether to push each filter down the plan based on various factors, e.g. the filter’s cost, selectivity of the graph pattern, etc.
SELECT ?s WHERE {
#pragma push.filters off
#the filter in the top scope will not be pushed into the union
?s :p ?o1 .
FILTER (?o2 > 10)
{
#pragma push.filters aggressive
#the optimizer will place this filter directly on top of ?s :r ?o3
#and it will be evaluated before the results are joined with ?s :p ?o2
?s :p ?o2 ;
:r ?o3 .
FILTER (?o3 > 1000)
}
UNION
{
#pragma push.filters default
#the optimizer will decide whether to place the filter directly
#on top of ?s :q ?o3 or leave it on top of the join
?s a :Type ;
:q ?o3 .
FILTER (?o3 < 50)
}
}
cardinality
Another useful hint is cardinality
. Estimations of cardinality, i.e. how many results a part of the query is expected to match, is critical for the query optimizer. Usually sub-optimal query plans are due to poor cardinality estimations, which happen more often as queries get more complex. The cardinality hint enables the user to suggest the cardinality for the pattern in {}
, thus helping the optimizer to decide how to join it with the rest of the query:
SELECT * {
{ #pragma cardinality 100000
?person a :Person ;
:friend ?other .
}
?person :name ?name
}
If you know that a pattern is selective, but don’t know how selective, you could substitute a value of low
instead of an integer for the query hint value. Conversely, if you know that a pattern is not selective, but don’t know how unselective, you could provide a value of high
instead of an integer for the query hint value.
Reasoning Hints
All types of queries (that is, SELECT
, ASK
, CONSTRUCT
, PATHS
, DESCRIBE
), and updates can be evaluated with reasoning. When reasoning is enabled, it applies to all query patterns in WHERE
and VIA
blocks. It is possible to selectively disable it for certain parts of the query using the reasoning
hint as follows:
SELECT * WHERE {
?person rdf:type :Employee .
{ #pragma reasoning off
?person ?p ?o
}
}
This query uses reasoning to select all employees thus retrieves managers, etc. but returns only asserted properties for each of them. Disabling reasoning for ?s ?p ?o
patterns is often handy since those may cause performance problems while not providing particularly useful inferences. In complex queries, it is possible to re-enable reasoning for a nested graph scope with #pragma reasoning on
. The hint is ignored when the query is evaluated without reasoning.
Statistics Configuration
Named Graph Characteristics
In Stardog, statistics are computed and managed against all local graphs in a database. In case the active graph of patterns in the query are a subset of those graphs, the cardinality estimations for those patterns can be scaled based on the size of the corresponding active graph. The assumption is that all graphs in the database contain structurally similar data and it is feasible to scale estimation based on the active graph size and the size of the entire database.
For example, consider a scenario where data about employees and buildings of a company should be stored. The data about employees and buildings is managed by departments. The data could be stored and queried in the following two ways.
- Homogeneous named graphs: The data is stored in a single database with a named graph per department that contains employee and building data. In this case, when the active graph of a pattern is a single named graph in the database, it is feasible to scale the cardinality estimation based on the size of that named graph because all named graphs contain structurally similar data (e.g., the same characteristic sets).
- Heterogeneous named graphs: The data is stored in one database per department with a named graph for the employee data and a named graph for the building data. If the active graph of a pattern is a single named graph in the database, it is not feasible to scale the cardinality estimation based on the size of that named graph. That is because the named graphs are not likely to share structurally similar data (e.g., they do not share the same characteristic sets) and therefore, scaling cardinality estimations can lead to underestimations.
By default, the first case of homogeneous named graphs is assumed. In the presence of heterogeneous named graphs, the database option index.statistics.enable_active_graph
can be set to false
to improve the accuracy of cardinality estimations and thus query performance.
Scope Evaluation Hints
There are situations when one part of the query is quite complex but very selective. With #pragma evaluate on
hint the optimizer can materialize such part - replace it with results it produces (logically represented in query plan as the VALUES operator). This can decrease overall query exectution time however there are some trade-offs:
- materialization itself takes time
- optimizer can’t use cached plans with
evaluate
hint; the query needs to be optimized every time it runs
In addition to evaluate
hint, evaluate.limit
hint can be used to specify upper bound on the number of results returned from materizlized scope. This hint takes an integer value (by default, the value of com.stardog.optimizer.evaluate.max_results
property used as an upper bound).
SELECT * WHERE {
?person :livesIn ?city .
{
#pragma evaluate on
#pragma evaluate.limit 10
?person foaf:knows/:ssn "123-12-1234"
}
}
Further Reading
For more information on optimizing and writing queries in Stardog, checkout out some of our more detailed blogs: