Link Search Menu Expand Document
Start for Free

Managing Query Performance

This page describes various ways to manage and analyze query performance in Stardog.

Page Contents
  1. Overview
  2. Query Plan Syntax
  3. Stardog Evaluation Model
  4. Pipelining And Pipeline Breakers
  5. Skipping Intermediate Results
  6. Diagnosing Performance Problems
  7. Query Plan Operators
  8. Query Hints
  9. Further Reading

Overview

Stardog answers queries in two major phases: determining the query plan and executing that plan to obtain answers from the data. The former is called query planning (or query optimization) and includes all steps required to select the most efficient way to execute the query. How Stardog evaluates a query can only be understood by analyzing the query plan. Query plan analysis is also the main tool for investigating performance issues as well as addressing them, in particular, by re-formulating the query to make it more amenable to optimization.

Query Plan Syntax

We will use the following running example to explain query plans in Stardog.

SELECT DISTINCT ?person ?name
WHERE {
  ?article rdf:type bench:Article .
  ?article dc:creator ?person .
  ?inproc rdf:type bench:Inproceedings .
  ?inproc dc:creator ?person .
  ?person foaf:name ?name
}

This query returns the names of all people who have authored both a journal article and a paper in a conference proceedings. The query plan used by Stardog (in this example, 4.2.2) to evaluate this query is:

Distinct [#812K]
`─ Projection(?person, ?name) [#812K]
  `─ MergeJoin(?person) [#812K]
      +─ MergeJoin(?person) [#391K]
      │  +─ Sort(?person) [#391K]
      │  │  `─ MergeJoin(?article) [#391K]
      │  │     +─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]
      │  │     `─ Scan[PSOC](?article, dc:creator, ?person) [#898K]
      │  `─ Scan[PSOC](?person, foaf:name, ?name) [#433K]
      `─ Sort(?person) [#503K]
        `─ MergeJoin(?inproc) [#503K]
            +─ Scan[POSC](?inproc, rdf:type, bench:Inproceedings) [#255K]
            `─ Scan[PSOC](?inproc, dc:creator, ?person) [#898K]

The plan is arranged in an hierarchical, tree-like structure. The nodes, called operators, represent units of data processing during evaluation. They correspond to evaluations of graphs patterns or solution modifiers as defined in SPARQL 1.1 specification. All operators can be regarded as functions which may take some data as input and produce some data as output. All input and output data is represented as streams of solutions, that is, sets of bindings of the form x -> value where x is a variable used in the query and value is some RDF term (IRI, blank node, or literal). Examples of operators include scans, joins, filters, unions, etc.

Numbers in square brackets after each node refer to the estimated cardinality of the node, i.e. how many solutions Stardog expects this operator to produce when the query is evaluated. Statistics-based cardinality estimation in Stardog merits a separate blog post, but here are the key points for the purpose of reading query plans:

  1. All estimations are approximate and their accuracy can vary greatly (generally: more precise for bottom nodes, less precise for upper nodes)
  2. Estimations are only used for selecting the best plan but have no bearing on the actual results of the query
  3. In most cases a sub-optimal plan can be explained by inaccurate estimations

Stardog Evaluation Model

Stardog generally evaluates query plans according to the bottom-up SPARQL semantics. Leaf nodes are evaluated first and without input, and their results are then sent to their parent nodes up the plan. Typical examples of leaf nodes include scans, i.e. evaluations of triple patterns, evaluations of full-text search predicates, and VALUES operators. They contain all information required to produce output, for example, a triple pattern can be directly evaluated against Stardog indexes. Parent nodes, such as joins, unions, or filters, take solutions as inputs and send their results further towards the root of the tree. The root node in the plan, which is typically one of the solution modifiers, produces the final results of the query which are then encoded and sent to the client.

Pipelining And Pipeline Breakers

Stardog implements the Volcano model, in which evaluation is as lazy as possible. Each operator does just enough work to produce the next solution. This is important for performance, especially for queries with a LIMIT clause (of which ASK queries are a special case) and also enables Stardog’s query engine to send the first result(s) as soon as they are available (as opposed to waiting till all results have been computed).

Not all operators can produce output solutions as soon as they get first input solutions from their children nodes. Some need to accumulate intermediate results before sending output. Such operators are called pipeline breakers, and they are often the culprits for performance problems, typically resulting from memory pressure. It is important to be able to spot them in the plan since they can suggest either a way to re-formulate the query to help the planner or a way to make the query more precise by specifying extra constants where they matter.

Here are some important pipeline breakers in the example plan:

  • HashJoin algorithms build a hash table for solutions produced by the right operand. Typically all such solutions need to be hashed, either in memory or spilled to disk, before the first output solution is produced by the HashJoin operator.
  • Sort: the sort operator builds an intermediate sorted collection of solutions produced by its child node. The main use case for sorting solutions is to prepare data for an operator which can benefit from sorted inputs, such as MergeJoin, Distinct, or GroupBy. All solutions have to be fetched from the child node before the smallest (w.r.t. the sort key) solution can be emitted.
  • GroupBy: group-by operators are used for aggregation, e.g. counting or summing results. When evaluating a query like select ?x (count(?y) as ?count) where { ... } group by ?x Stardog has to scroll through all solutions to compute the count for every ?x key before returning the first result.

Other operators can produce output as soon as they get input:

  • MergeJoin: merge join algorithms do a single zig-zag pass over sorted streams of solutions produced by children nodes and output a solution as soon as the join condition is satisfied.
  • DirectHashJoin: contrary to the classical hash join algorithm, this operator does not build a hash table. It utilizes Stardog indexes for look-ups which doesn’t require extra data structures. This is only possible when the right operand is sorted by the join key, but the left isn’t, otherwise Stardog would use a merge join.
  • Filter: a solution modifier which evaluates the filter condition on each input solution.
  • Union: combines streams of children solutions without any extra work, e.g. joining, so there’s no need for intermediate results.

Now, returning to the above query, one can see Sort pipeline breakers in the plan:

Sort(?person) [#391K]
`─ MergeJoin(?article) [#391K]
   +─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]
   `─ Scan[PSOC](?article, dc:creator, ?person) [#898K]

This means that all solutions representing the join of ?article rdf:type bench:Article and ?article dc:creator ?person will be put in a sequence ordered by the values of ?person. Stardog expects to sort 391K solutions before they can be further merge-joined with the results of the ?person foaf:name ?name pattern. Alternately the engine may build a hash table instead of sorting solutions; such decisions are made by the optimizer based on a number of factors.

Skipping Intermediate Results

One tricky part of understanding Stardog query plans is that evaluation of each operator in the plan is context-sensitive, i.e. it depends on what other nodes are in the same plan, maybe in a different sub-tree. In particular, the cardinality estimations, even if assumed accurate, only specify how many solutions the operator is expected to produce when evaluated as the root node of a plan.

However, as it is joined with other parts of the plan, the results can be different. This is because Stardog employs optimizations to reduce the number of solutions produced by a node by pruning those which are incompatible with other solutions with which they will later be joined.

Consider the following basic graph pattern and the corresponding plan:

?erdoes rdf:type foaf:Person .
?erdoes foaf:name "Paul Erdoes"^^xsd:string .
?document dc:creator ?erdoes .
MergeJoin(?erdoes) [#10]
+─ MergeJoin(?erdoes) [#1]
│  +─ Scan[POSC](?erdoes, rdf:type, foaf:Person) [#433K]
│  `─ Scan[POSC](?erdoes, foaf:name, "Paul Erdoes") [#1]
`─ Scan[POSC](?document, dc:creator, ?erdoes) [#898K]

The pattern matches all documents created by a person named Paul Erdoes. Here the second pattern is selective (only one entity is expected to have the name “Paul Erdoes”). This information is propagated to the other two scans in the plan via merge joins, which allows them to skip scanning large parts of data indexes.

In other words, the node Scan[POSC](?erdoes, rdf:type, foaf:Person) [#433K] will not produce all 433K solutions corresponding to all people in the database and, similarly, Scan[POSC](?document, dc:creator, ?erdoes) [#898K] will not go through all 898K document creators.

Diagnosing Performance Problems

Performance problems may arise because of two reasons:

  1. complexity of the query itself, especially the amount of returned data
  2. failure to select a good plan for the query.

It is important to distinguish the two. In the former case the best way forward is to make the patterns in WHERE more selective. In the latter case, i.e. when the query returns some modest number of results but takes an unacceptably long time to do so, one needs to look at the plan, identify the bottlenecks (most often, pipeline breakers), and reformulate the query or report it to us for further analysis.

Here’s an example of a un-selective query:

SELECT DISTINCT ?name1 ?name2
WHERE {
  ?article1 rdf:type bench:Article .
  ?article2 rdf:type bench:Article .
  ?article1 dc:creator ?author1 .
  ?author1 foaf:name ?name1 .
  ?article2 dc:creator ?author2 .
  ?author2 foaf:name ?name2 .
  ?article1 swrc:journal ?journal .
  ?article2 swrc:journal ?journal
  FILTER (?name1<?name2)
  }

The query returns all distinct pairs of authors who published (possibly different) articles in the same journal. It returns more than 18M results from a database of 5M triples. Here’s the plan:

Distinct [#17.7M]
`─ Projection(?name1, ?name2) [#17.7M]
   `─ Filter(?name1 < ?name2) [#17.7M]
      `─ HashJoin(?journal) [#35.4M]
         +─ MergeJoin(?author2) [#391K]
         │  +─ Sort(?author2) [#391K]
         │  │  `─ NaryJoin(?article2) [#391K]
         │  │     +─ Scan[POSC](?article2, rdf:type, bench:Article) [#208K]
         │  │     +─ Scan[PSOC](?article2, swrc:journal, ?journal) [#208K]
         │  │     `─ Scan[PSOC](?article2, dc:creator, ?author2) [#898K]
         │  `─ Scan[PSOC](?author2, foaf:name, ?name2) [#433K]
         `─ MergeJoin(?author1) [#391K]
            +─ Sort(?author1) [#391K]
            │  `─ NaryJoin(?article1) [#391K]
            │     +─ Scan[POSC](?article1, rdf:type, bench:Article) [#208K]
            │     +─ Scan[PSOC](?article1, swrc:journal, ?journal) [#208K]
            │     `─ Scan[PSOC](?article1, dc:creator, ?author1) [#898K]
            `─ Scan[PSOC](?author1, foaf:name, ?name1) [#433K]

This query requires an expensive join on ?journal which is evident from the plan (it’s a hash join in this case). It produces more than 18M results (Stardog expects 17.7M which is pretty accurate here) that need to be filtered and examined for duplicates. Given all this information from the plan, the only reasonable way to address the problem would be to restrict the criteria, e.g. to particular journals, people, time periods, etc.

If a query is well-formulated and selective, but performance is unsatisfactory, one may look closer at the pipeline breakers, e.g. this part of the query plan:

MergeJoin(?person) [#391K]
+─ Sort(?person) [#391K]
|  `─ MergeJoin(?article) [#391K]
|     +─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]
|     `─ Scan[PSOC](?article, dc:creator, ?person) [#898K]
`─ Scan[PSOC](?person, foaf:name, ?name) [#433K]

A reasonable thing to do would be to evaluate the join of ?article rdf:type bench:Article and ?article dc:creator ?person separately, i.e. as a separate queries, to see if the estimation of 391K is reasonably accurate and to get an idea about memory pressure. This is a valuable piece of information for a performance problem report, especially when the data cannot be shared with us. Similar analysis can be done for hash joins.

In addition to pipeline breakers, there could be other clear indicators of performance problems. One of them is the presence of LoopJoin nodes in the plan. Stardog implements the nested loop join algorithm which evaluates the join by going through the Cartesian product of its inputs. This is the slowest join algorithm and it is used only as a last resort. It sometimes, but not always, indicates a problem with the query.

Here’s an example:

SELECT DISTINCT ?person ?name
WHERE {
  ?article rdf:type bench:Article .
  ?article dc:creator ?person .
  ?inproc rdf:type bench:Inproceedings .
  ?inproc dc:creator ?person2 .
  ?person foaf:name ?name .
  ?person2 foaf:name ?name2
  FILTER (?name=?name2)
}

The query is similar to an earlier query plan we saw but runs much slower. The plan shows why:

Distinct [#98456.0M]
`─ Projection(?person, ?name) [#98456.0M]
   `─ Filter(?name = ?name2) [#98456.0M]
      `─ LoopJoin(_) [#196912.1M]
         +─ MergeJoin(?person) [#391K]
         │  +─ Sort(?person) [#391K]
         │  │  `─ MergeJoin(?article) [#391K]
         │  │     +─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]
         │  │     `─ Scan[PSOC](?article, dc:creator, ?person) [#898K]
         │  `─ Scan[PSOC](?person, foaf:name, ?name) [#433K]
         `─ MergeJoin(?person2) [#503K]
            +─ Sort(?person2) [#503K]
            │  `─ MergeJoin(?inproc) [#503K]
            │     +─ Scan[POSC](?inproc, rdf:type, bench:Inproceedings) [#255K]
            │     `─ Scan[PSOC](?inproc, dc:creator, ?person2) [#898K]
            `─ Scan[PSOC](?person2, foaf:name, ?name2) [#433K]

The loop join near the top of the plan computes the Cartesian product of the arguments which produces almost 200B solutions. This is because there is no shared variable between the parts of the query which correspond to authors of articles and conference proceedings papers, respectively. The filter condition ?name = ?name2 cannot be transformed into an equi-join because the semantics of term equality used in filters is different from the solution compatibility semantics used for checking join conditions.

The difference manifests itself in the presence of numerical literals, e.g. "1"^^xsd:integer = "1.0"^^xsd:float, where they are different RDF terms. However, as long as all names in the data are strings, one can re-formulate this query by renaming ?name2 to ?name which would enable Stardog to use a more efficient join algorithm.

Query Plan Operators

The following operators are used in Stardog query plans:

Operator Description
Scan[Index] evaluates a triple/quad pattern against Stardog indexes. Indicates the index used, e.g. CSPO or POSC, where S,P,O,C stand for the kind of lexicographic ordering of quads that the index provides. SPOC means that the index is sorted first by Subject, then Predicate, Object, and Context (named graph IRI)
HashJoin(join key) hash join algorithm, hashes the right operand. Pipeline breaker.
BindJoin(join key) a join algorithm binds the join variables on the right operator to the current values of the same variables in the current solution on the left. Can be seen as an optimization of the nested loop join for the case when the left operator produces far fewer results than the right. Not a pipeline breaker.
DirectHashJoin(join key) a hash join algorithm which directly uses indexes for lookups instead of building a hash table. Not a pipeline breaker.
MergeJoin(join key) merge join algorithm, the fastest option for joining two streams of solutions. Requires both operands be sorted on the join key. Not a pipeline breaker.
BindJoin this join algorithm binds (thus the name) join key variables of the right operator to the current values of the same variables of the left operator and re-evaluates the right. Not a pipeline breaker.
NaryJoin(join key) same as MergeJoin but for N operators sorted on the same join key.
NestedLoopJoin the nested loop join algorithm, the slowest join option. The only join option when there is no join key. Not a pipeline breaker.
Shortest|All(Cyclic)Paths Path operators.
Sort(sort key) sorts the argument solutions by the sort key, typically used as a part of a merge join. Pipeline breaker.
Filter(condition) filters argument solutions according to the condition. Not a pipeline breaker.
Union combines streams of argument solutions. If both streams are sorted by the same variable, the result is also sorted by that variable. Not a pipeline breaker.
Minus Removes solutions from the left operand that are compatible with solutions from the right operand. Pipeline breaker.
PropertyPath evaluates a property path pattern against Stardog indexes. Not a pipeline breaker.
GroupBy groups results of the child operator by values of the group-by expressions (i.e. keys) and aggregates solutions for each key. Pipeline breaker (unless the input is sorted by first key).
Distinct removes duplicate solutions from the input. Not a pipeline breaker but accumulates solutions as it runs so the memory pressure increases with the number of unique solutions.
VALUES produces the inlined results specified in the query. Not a pipeline breaker.
Search evaluates a full-text search predicates against the Lucene index within a Stardog database.
Projection projects variables as results of a query or a sub-query. Not a pipeline breaker.
Bind evaluates expressions on each argument solution and binds their values to (new) variables. Not a pipeline breaker.
Unnest unnest array expressions. See the UNNEST operator. Not a pipeline breaker.
Empty and Singleton correspond to the empty solution set and a single empty solution, respectively.
Type reasoning operator for evaluating patterns of the form ?x rdf:type ?type or :instance rdf:type ?type. Not a pipeline breaker.
Property operator for evaluating triple patterns with unbound predicate with reasoning. Not a pipeline breaker but could be very expensive especially for large schemas. Better be avoided by either using an IRI in the predicate position or turning off reasoning for such patterns using a hint.
Service SPARQL federation operator which evaluate a pattern against a remote SPARQL endpoint or a virtual graph.
ServiceJoin(join key) a join algorithm used when one of the operators is a Service (see above). Propagates bindings from the the operator to reduce the number of results coming over the network.
Slice(offset=<>, limit=<>) combines LIMIT and OFFSET solution modifiers in SPARQL.
OrderBy an operator which implements the ORDER BY solution modifier in SPARQL.
Describe a SPARQL Describe operator.
ADD, CLEAR, COPY, LOAD, MOVE, DELETE, DELETE DATA, INSERT, INSERT DATA SPARQL Update operators.

Query Hints

Query hints help Stardog generate optimized query plans. They can be expressed in queries in two ways:

  1. as SPARQL comments started with the pragma keyword, e.g. #pragma push.filters aggressive, or
  2. as SPARQL triple patterns of the form [] <tag:stardog:api:hint:{hint name}> {hint value}, e.g. [] hint:push.filters "aggressive"

The first approach makes hints transparent to other query processing tools (other than Stardog). The second approach is preferred when using 3rd party tools which do not preserve SPARQL comments. In both cases a hint applies to the scope where it’s used and all nested scopes (unless overridden by the same hint with a different value).

The equality.identity hint expects a comma-separated list of variables. It tells Stardog that these variables will be bound to RDF terms (IRIs, bnodes, or literals) for which equality coincides with identity (i.e. any term is equal only to itself). This is not true for literals of certain numerical datatypes (cf. Operator Mapping). However assuming that the listed variables do not take on values of such datatypes can sometimes lead to faster query plans, for example, because of converting some filters to joins and through value inlining.

 SELECT ?o ?o2 WHERE {
   #pragma equality.identity ?o,?o2
   ?s   :p  ?o ;
        :q  ?o2
   FILTER (?o = ?o2)
 }

Sometimes our query planner can produce sub-optimal join orderings. The group.joins hint introduces an explicit scoping mechanism to help with join order optimization. Patterns in the scope of the hint, given by the enclosing {}, will be joined together before being joined with anything else. This way, you can tell the query planner what you think is the optimal way to join variables.

select ?s where {
  ?s :p ?o1 .
  {
    #pragma group.joins
    #these patterns will be joined first, before being joined with the other pattern
    ?s :p ?o2 .
    ?o1 :p ?o3 .
  }
}

The push.filters hint controls how the query optimizer pushes filters down the query plan. There are three possible values: default, aggressive, and off. The aggressive option means that the optimizer will push every filter to the deepest operator in the plan which binds variables used in the filter expression. The off option turn the optimization off and each filter will be applied to the top operator in the filter’s graph pattern (in case there’re multiple filters, their order is not specified). Finally, the default option (or absence of the hint) means that the optimizer will decide whether to push each filter down the plan based on various factors, e.g. the filter’s cost, selectivity of the graph pattern, etc.

select ?s where {
  #pragma push.filters off
  #the filter in the top scope will not be pushed into the union
  ?s :p ?o1 .
  FILTER (?o2 > 10)
  {
    #pragma push.filters aggressive
    #the optimizer will place this filter directly on top of ?s :r ?o3
    #and it will be evaluated before the results are joined with ?s :p ?o2
    ?s :p ?o2 ;
       :r ?o3 .
    FILTER (?o3 > 1000)
  }
  UNION
  {
    #pragma push.filters default
    #the optimizer will decide whether to place the filter directly
    #on top of ?s :q ?o3 or leave it on top of the join
    ?s a :Type ;
       :q ?o3 .
    FILTER (?o3 < 50)
  }
}

Further Reading

For more information on optimizing and writing queries in Stardog, checkout out some of our more detailed blogs: