Fixing Avoidably Slow Queries
In this tutorial we will focus on how to improve the performance of slow queries.
Page Contents
Unselective Queries
First, let’s consider unselective queries, that is, those which return large result sets and are often user errors. In the worst case an unselective query is indistinguishable from a bulk data transfer over a highly inefficient mechanism.
Here’s a simplified version of one such query that we saw recently:
CONSTRUCT {
?bus a :Bus .
?tram a :Tram .
}
WHERE {
?bus rdfs:label ?busLabel.
FILTER (CONTAINS(LCASE(?busLabel), "bus"))
?tram rdfs:label ?tramLabel.
FILTER (CONTAINS(LCASE(?tramLabel), "tram"))
}
With this query the user wants to identify certain objects by their labels and create proper type triples (e.g. for buses and trams). This is a common pattern: pick some feature as the basis of an explicit type assertion—this node is part of the category or class of nodes called “foo” because it has feature “bar”).
The problem is that this query will cause Stardog to compute and return all pairs of buses and trams instead of a simple union of buses and trams. The former explodes and there’s nothing that the optimizer can do about it. It’s rarely what the user intends. This is visible from the query plan with the top-level loop join:
Projection(?bus AS ?subject, ?ltnuylet AS ?predicate, ?immqbtmb AS ?object;
+─ ?tram AS ?subject, ?ltnuylet AS ?predicate, ?phfyiimx AS ?object)
`─ Bind(<urn:Tram> AS ?phfyiimx) (<urn:Bus> AS ?immqbtmb) (rdf:type AS ?ltnuylet)
`─ LoopJoin(_)
+─ Filter(CONTAINS(LCASE(?tram), "tram"))
│ `─ Scan[PSOC](?tram, <http://www.w3.org/2000/01/rdf-schema#label>, ?busLabel)
`─ Filter(CONTAINS(LCASE(?bus), "bus"))
`─ Scan[PSOC](?bus, <http://www.w3.org/2000/01/rdf-schema#label>, ?tramLabel)
The better way to write the intended query is like this:
CONSTRUCT {
?bus a :Bus .
?tram a :Tram .
}
WHERE {
{ ?bus rdfs:label ?busLabel.
FILTER (CONTAINS(LCASE(?bus), "bus"))
}
UNION
{
?tram rdfs:label ?tramLabel.
FILTER (CONTAINS(LCASE(?tram), "tram"))
}
}
Even experienced users can occasionally overlook such mistakes. The first thing to consider when diagnosing a slow query is to double check that the query is going to return the results you want it to return (and, ideally, nothing else).
Type Promotion
The semantics of literal equality are such that efficient evaluation of certain types of filters isn’t possible. Here’s a simplified query from the SP2B benchmark:
SELECT DISTINCT ?person ?name
WHERE {
?article rdf:type bench:Article .
?article dc:creator ?person .
?inproc rdf:type bench:Inproceedings .
?inproc dc:creator ?person2 .
FILTER (?person=?person2)
}
This may look like an innocuous join on ?person=?person2
but it can return a superset of results of this query:
SELECT DISTINCT ?person ?name
WHERE {
?article rdf:type bench:Article .
?article dc:creator ?person .
?inproc rdf:type bench:Inproceedings .
?inproc dc:creator ?person .
}
And it’s harder to evaluate since even literals of different datatypes, e.g. "10.0"^^xsd:float
and "10.0"^^xsd:decimal
, can be equal.
This is called type promotion. Maintaining special literal indexes to represent the equality classes for data literals is an unnecessary overhead. Such queries are rare.
In the absence of such indexes, however, the filter has to be applied to all pairs of ?person
and ?person2
. This is again visible from the query plan with a loop join. Stardog 5 provides a query hint assume.iri
which tells the optimizer that the given variable, e.g. ?person
will be bound to IRIs only so the optimizer can remove the filter and evaluate a join instead.
Hard Analytic Queries
The more interesting problems occur with naturally hard queries which process a lot of data, which is typical for analytic queries. Those often do not include a selective constant—as opposed to look-up queries (e.g. “give me names of all of John’s friends”)—but rather perform complex computations over large portions of the graph.
This query asks for the top 10 most discussed product categories of products from a specific country based on number of reviews by reviewers from another country.
select ?productType ?reviewCount
{
{ select ?productType (count(?review) As ?reviewCount)
{
?productType a bsbm:ProductType .
?product a ?productType .
?product bsbm:producer ?producer .
?producer bsbm:country country:china .
?review bsbm:reviewFor ?product .
?review rev:reviewer ?reviewer .
?reviewer bsbm:country country:us .
}
group by ?productType
}
}
order by desc(?reviewCount) ?productType
limit 10
Neither country may be a selective restriction so the query engine may churn through a lot of data to find the required product categories. It is this sort of query for which the optimal plan—specifically, the order of joins—is critical. When performance is unsatisfactory, it may be possible to reformulate the query such that it’s easier for the optimizer to find the best plan.