Stored Query Service
This page discusses the Stored Query Service which enables users to run stored queries as subqueries.
Overview
Stardog supports a way to invoke stored queries, including Path Queries in the context of another SPARQL query using the SERVICE
keyword. The Stored Query Service (SQS) was released as beta in Stardog 7.3.2 and is generally available (GA) as of version 7.4.0. Previous versions of Stardog already employed the service mechanism in SPARQL to support Full-Text Search and Entity Extraction and now this is naturally extended to stored queries. Suppose, the following query is stored with the name “cities”:
$ stardog-admin stored add -n "cities" "SELECT ?country ?city { ?city :locatedIn ?country }"
Then it is possible to use it as a named subquery in another query:
prefix sqs: <tag:stardog:api:sqs:>
SELECT ?person ?city ?country {
SERVICE <query://cities> { [] sqs:vars ?country, ?city }
?person :from ?city
}
This query uses the “cities” query to look up information about the country given the city where a person lives. It is similar to using a Wikidata endpoint or an explicit subquery except that the subquery is referenced by name. The same query with an explicit subquery would look like this:
SELECT ?person ?city ?country {
{
SELECT ?country ?city {
?city :locatedIn ?country
}
}
?person :from ?city
}
Invoking stored queries by name has the major benefit that it avoids duplication of their query strings. Stored queries become reusable query building blocks maintained in one place rather than copy-pasted over the many queries which use them.
The body pattern of SERVICE <query://name> { ... }
specifies which variables of the stored query are used in the outer scope of the calling query. The sqs:vars
is a shortcut which is useful when stored query variables retain their names. However it’s possible to map stored query variable names to other identifiers to avoid naming conflicts:
prefix sqs: <tag:stardog:api:sqs:>
SELECT ?person ?city ?livesIn ?country {
SERVICE <query://countries> {
[] sqs:var:city ?livesIn ;
sqs:var:country ?country
}
?person :from ?livesIn ;
:born ?city
}
Furthermore, it’s possible to statically bind some stored query variables to constants so the query would behave like a parameterized view:
prefix sqs: <tag:stardog:api:sqs:>
SELECT ?city ?country {
SERVICE <query://countries> {
[] sqs:var:city ?city ;
sqs:var:country :The_United_States
}
}
Path Subqueries
Another interesting feature is the ability to call path queries from SELECT
/CONSTRUCT
/ASK
queries. One cannot directly use a path query in a subquery because those do not return SPARQL binding sets, aka solutions (we discussed that issue in an earlier blog post on Extended Solutions). However, this service circumvents that restriction:
prefix sqs: <tag:stardog:api:sqs:>
SELECT ?start (count(*) as ?paths) {
SERVICE <query://paths> {
[] sqs:vars ?start
}
} GROUP BY ?start
The stored path query returns paths (according to some VIA
pattern) and uses ?start
as the start node variable. The main query aggregates the returned paths by the start node and returns the number of paths for each. In contrast to the earlier SELECT
example, this would not be possible directly because path queries cannot be used as subqueries.
One should be aware of the potential explosive nature of path queries when using them through the stored query service. They can return a very high number of paths to be joined or aggregated and thus create substantial memory pressure on the server.
Stardog 7.3.2+ supports two new SPARQL functions which take paths as the argument: stardog:length
and stardog:nodes
. The former returns the length of the path and the latter generates a comma-separated string of all path nodes. Since SELECT query results do not support paths as first-class citizens (that is, any value in a binding set is either an IRI or a literal or a blank node), these provide means to return path information by generating literals. Paths returned by the stored query service can be accessed via the reserved variable name ?path
:
prefix sqs: <tag:stardog:api:sqs:>
prefix stardog: <tag:stardog:api:>
SELECT ?start (avg(stardog:length(?path)) as ?avg_length) {
SERVICE <query://paths> {
[] sqs:vars ?start, ?path
}
} GROUP BY ?start
As of 7.4.4 Stardog supports additional stardog:all
and stardog:any
functions to check Boolean conditions over edges in paths returned by a stored path query. These are useful for filtering path query results on the server side:
prefix sqs: <tag:stardog:api:sqs:>
prefix stardog: <tag:stardog:api:>
SELECT (str(stardog:nodes(?path))) {
SERVICE <query://paths> {
[] sqs:vars ?path
}
FILTER(stardog:all(?path, ?attribute = 10))
}
Here ?attribute
is a variable occurring in the VIA pattern of the stored path query. stardog:all
returns true
if the ?attribute = 10
condition is true
for all edges in the path. The second argument can be an arbitrary SPARQL expression. stardog:any
is the complementary function returning true
if the condition is true
for at least one edge. It is particularly useful for querying paths which must pass through a particular node(s) in the graph.
Correlated Subqueries
By default, evaluation of subqueries referenced through SQS is subject to the standard bottom-up SPARQL semantics. Specifically, they are evaluated once and their results are joined with other query patterns in the same scope (or Group Graph Pattern, or {} in SPARQL). In other words, just as for standard subqueries in SPARQL, the evaluation is uncorrelated as the subquery cannot use values of variables from the outer query. Consider the following example:
SELECT * {
?person :hasAge ?age
FILTER (?age > ?majorityAge)
}
This query is parameterized on the value of ?majorityAge
and is meant to select all adults, i.e. people whose age exceeds their respective age of majority. Since majorityAge
is not bound anywhere in this query, running it as-is will return empty results on any data. Thus using it as a standard, uncorrelated subquery will never achieve the intended results, as in the following example (the filter in the subquery will never evaluate to true
):
# this won't return desired results!
SELECT ?country (count(?person) as ?c) {
{
SELECT * {
?person :hasAge ?age
FILTER (?age > ?majorityAge)
}
}
?country :hasMajorityAge ?majorityAge
} GROUP BY ?country
Overcoming this limitation with the standard SPARQL subqueries requires workarounds, such as moving the filter outside of the subquery or doing the loop over countries in a separate query. This is often inconvenient and may cause loss of performance (for example, enabling the Literal Index could make the filter in the subquery more efficient if ?majorityAge
is bound). These issues are well-known and traditionally addressed by correlated subqueries which are executed once per each tuple of values of variables bound in the outer query.
As of Stardog 7.6.1, SQS provides a way to indicate that the (stored) subquery is correlated on particular variables. This is done using the sqs:inputs
predicate in the SERVICE
pattern:
SELECT ?country (count(?person) as ?c) {
SERVICE <query://persons-by-age> {
[] sqs:inputs ?majorityAge ;
sqs:vars ?person
}
?country :hasMajorityAge ?majorityAge
} GROUP BY ?country
Now the query engine will execute the stored subquery for each value of ?majorityAge
generated by the outer triple pattern, i.e. ?country :hasMajorityAge ?majorityAge
. This is visible in the query plan where the outer pattern is now an argument of a ServiceJoin
operator which uses it to assign values to input variables (?majorityAge
) before each execution of the service.
Group(by=[?country] aggregates=[(COUNT(*) AS ?c)]) [#1]
`─ ServiceJoin [#5.0K]
+─ StoredQuery(persons-by-age: (?majorityAge) -> (?majorityAge, ?person)) {
│ +─ Projection(?person, ?age, ?majorityAge)
│ +─ `─ Filter(?age > ?majorityAge)
│ +─ `─ Scan[SPO](?person, :age, ?age) [#1]
│ }
`─ Scan[PSO](?country, <http://api.stardog.com/majorityAge>, ?majorityAge) [#1]
Another classical example where correlated execution is essential is Top-K subqueries. Consider the following simple example: given payroll data, find three highest paid employees in each department. The naive attempt to do it in pure SPARQL will fall short of the goal:
# this won't return desired results!
SELECT ?dept ?emp {
?dept a :Department
{ SELECT ?emp {
?emp :worksIn ?dept ;
:salary ?salary
} ORDER BY desc(?salary) LIMIT 3
}
} ORDER BY ?dept
The issue is again that the subquery is executed once and independently of the rest of the query. It will return the three highest paid employees across all departments. The intention is of course to execute the subquery for each department after binding ?dept
to a value matched by the outer query.
This can be achieved using SQS as follows:
SELECT ?dept ?emp {
?dept a :Department
SERVICE <query://employees> {
[] sqs:inputs ?dept ; sqs:vars ?emp
}
} ORDER BY ?dept
Just as for other variables, it’s possible to map input variables in the subquery to other main query’s variables, like ?dept
to ?department
in the following example. The engine will bind ?dept
to the current value of ?department
before each execution.
SELECT ?dept ?emp {
?department a :Department
SERVICE <query://employees> {
[] sqs:inputs ?department ; sqs:var:dept ?department ; sqs:vars ?emp
}
} ORDER BY ?dept
Correlated execution is also supported for path subqueries. There it is particularly important given the potentially high number of returned paths if the subquery runs without inputs. In the following example the paths subquery is executed for each value of the start variable:
prefix sqs: <tag:stardog:api:sqs:>
prefix stardog: <tag:stardog:api:>
SELECT ?start (str(stardog:nodes(?path)) as ?pstr) {
VALUES ?start { :X :Y :Z }
SERVICE <query://paths> {
[] sqs:inputs ?start ; sqs:vars ?path
}
} GROUP BY ?start
It is similarly possible to indicate that variables appearing inside the VIA
, START
, or END
patterns of a paths subquery are inputs.
Defining Dataset for Subqueries
SPARQL does not allow specifying the query dataset for subqueries i.e. one cannot use FROM or FROM NAMED keywords in a subquery. Subqueries inherit the dataset from the main query. SQS enables two ways of defining dataset for a subquery (in the order of precedence):
- directly in the SQS pattern using
sqs:default-graph
andsqs:named-graph
predicates. - inside the stored query using the standard
FROM
andFROM NAMED
keywords.
If none of the above is used, the stored subquery will inherit the dataset from the main query.
Both sqs:default-graph
and sqs:named-graph
predicates can be used to specify multiple graphs to define the default and the named part of the query dataset (just like FROM
and FROM NAMED
keywords can be used multiple times):
SELECT * {
SERVICE <query://name> {
[] sqs:default-graph :g1, :g2 ;
sqs:named-graph :g3, :g4
}
}
It is possible to use Named Graph Aliases both in stored subqueries and in the range of sqs:default-graph
and sqs:named-graph
predicates.