Link Search Menu Expand Document
Start for Free

Stored Query Service

This page discusses the Stored Query Service which enables users to run stored queries as subqueries.

Page Contents
  1. Overview
  2. Path Subqueries
  3. Correlated Subqueries
  4. Defining Dataset for Subqueries

Overview

Stardog supports a way to invoke stored queries, including Path Queries in the context of another SPARQL query using the SERVICE keyword. The Stored Query Service (SQS) was released as beta in Stardog 7.3.2 and is generally available (GA) as of version 7.4.0. Previous versions of Stardog already employed the service mechanism in SPARQL to support Full-Text Search and Entity Extraction and now this is naturally extended to stored queries. Suppose, the following query is stored with the name “cities”:

$ stardog-admin stored add -n "cities" "SELECT ?country ?city { ?city :locatedIn ?country }"

Then it is possible to use it as a named subquery in another query:

prefix sqs: <tag:stardog:api:sqs:>

SELECT ?person ?city ?country {
    SERVICE <query://cities> { [] sqs:vars ?country, ?city }
    ?person :from ?city
}

This query uses the “cities” query to look up information about the country given the city where a person lives. It is similar to using a Wikidata endpoint or an explicit subquery except that the subquery is referenced by name. The same query with an explicit subquery would look like this:

SELECT ?person ?city ?country {
    {
        SELECT ?country ?city {
            ?city :locatedIn ?country
        }
    }
    ?person :from ?city
}

Invoking stored queries by name has the major benefit that it avoids duplication of their query strings. Stored queries become reusable query building blocks maintained in one place rather than copy-pasted over the many queries which use them.

The body pattern of SERVICE <query://name> { ... } specifies which variables of the stored query are used in the outer scope of the calling query. The sqs:vars is a shortcut which is useful when stored query variables retain their names. However it’s possible to map stored query variable names to other identifiers to avoid naming conflicts:

prefix sqs: <tag:stardog:api:sqs:>

SELECT ?person ?city ?livesIn ?country {
    SERVICE <query://countries> {
        []  sqs:var:city ?livesIn ;
            sqs:var:country ?country
    }
    ?person :from ?livesIn ;
            :born ?city
}

Furthermore, it’s possible to statically bind some stored query variables to constants so the query would behave like a parameterized view:

prefix sqs: <tag:stardog:api:sqs:>

SELECT ?city ?country {
    SERVICE <query://countries> {
        []  sqs:var:city ?city ;
            sqs:var:country :The_United_States
    }
}

Path Subqueries

Another interesting feature is the ability to call path queries from SELECT/CONSTRUCT/ASK queries. One cannot directly use a path query in a subquery because those do not return SPARQL binding sets, aka solutions (we discussed that issue in an earlier blog post on Extended Solutions). However, this service circumvents that restriction:

prefix sqs: <tag:stardog:api:sqs:>

SELECT ?start (count(*) as ?paths) {
    SERVICE <query://paths> {
        [] sqs:vars ?start
    }
} GROUP BY ?start

The stored path query returns paths (according to some VIA pattern) and uses ?start as the start node variable. The main query aggregates the returned paths by the start node and returns the number of paths for each. In contrast to the earlier SELECT example, this would not be possible directly because path queries cannot be used as subqueries.

One should be aware of the potential explosive nature of path queries when using them through the stored query service. They can return a very high number of paths to be joined or aggregated and thus create substantial memory pressure on the server.

Stardog 7.3.2+ supports two new SPARQL functions which take paths as the argument: stardog:length and stardog:nodes. The former returns the length of the path and the latter generates a comma-separated string of all path nodes. Since SELECT query results do not support paths as first-class citizens (that is, any value in a binding set is either an IRI or a literal or a blank node), these provide means to return path information by generating literals. Paths returned by the stored query service can be accessed via the reserved variable name ?path:

prefix sqs: <tag:stardog:api:sqs:>
prefix stardog: <tag:stardog:api:>

SELECT ?start (avg(stardog:length(?path)) as ?avg_length) {
    SERVICE <query://paths> {
        [] sqs:vars ?start, ?path
    }
} GROUP BY ?start

As of 7.4.4 Stardog supports additional stardog:all and stardog:any functions to check Boolean conditions over edges in paths returned by a stored path query. These are useful for filtering path query results on the server side:

prefix sqs: <tag:stardog:api:sqs:>
prefix stardog: <tag:stardog:api:>

SELECT (str(stardog:nodes(?path))) {
    SERVICE <query://paths> {
        [] sqs:vars ?path
    }
    FILTER(stardog:all(?path, ?attribute = 10))
}

Here ?attribute is a variable occurring in the VIA pattern of the stored path query. stardog:all returns true if the ?attribute = 10 condition is true for all edges in the path. The second argument can be an arbitrary SPARQL expression. stardog:any is the complementary function returning true if the condition is true for at least one edge. It is particularly useful for querying paths which must pass through a particular node(s) in the graph.

Correlated Subqueries

By default, evaluation of subqueries referenced through SQS is subject to the standard bottom-up SPARQL semantics. Specifically, they are evaluated once and their results are joined with other query patterns in the same scope (or Group Graph Pattern, or {} in SPARQL). In other words, just as for standard subqueries in SPARQL, the evaluation is uncorrelated as the subquery cannot use values of variables from the outer query. Consider the following example:

SELECT * {
  ?person :hasAge ?age
  FILTER (?age > ?majorityAge) 
}

This query is parameterized on the value of ?majorityAge and is meant to select all adults, i.e. people whose age exceeds their respective age of majority. Since majorityAge is not bound anywhere in this query, running it as-is will return empty results on any data. Thus using it as a standard, uncorrelated subquery will never achieve the intended results, as in the following example (the filter in the subquery will never evaluate to true):

# this won't return desired results!
SELECT ?country (count(?person) as ?c) {
  {
    SELECT * {
      ?person :hasAge ?age
      FILTER (?age > ?majorityAge)
    }
  }
  ?country :hasMajorityAge ?majorityAge  
} GROUP BY ?country

Overcoming this limitation with the standard SPARQL subqueries requires workarounds, such as moving the filter outside of the subquery or doing the loop over countries in a separate query. This is often inconvenient and may cause loss of performance (for example, enabling the Literal Index could make the filter in the subquery more efficient if ?majorityAge is bound). These issues are well-known and traditionally addressed by correlated subqueries which are executed once per each tuple of values of variables bound in the outer query.

As of Stardog 7.6.1, SQS provides a way to indicate that the (stored) subquery is correlated on particular variables. This is done using the sqs:inputs predicate in the SERVICE pattern:

SELECT ?country (count(?person) as ?c) {
  SERVICE <query://persons-by-age> {
    [] sqs:inputs ?majorityAge ;
       sqs:vars ?person
  }
  ?country :hasMajorityAge ?majorityAge  
} GROUP BY ?country

Now the query engine will execute the stored subquery for each value of ?majorityAge generated by the outer triple pattern, i.e. ?country :hasMajorityAge ?majorityAge. This is visible in the query plan where the outer pattern is now an argument of a ServiceJoin operator which uses it to assign values to input variables (?majorityAge) before each execution of the service.

Group(by=[?country] aggregates=[(COUNT(*) AS ?c)]) [#1]
`─ ServiceJoin [#5.0K]
   +─ StoredQuery(persons-by-age: (?majorityAge) -> (?majorityAge, ?person)) {
   │  +─ Projection(?person, ?age, ?majorityAge)
   │  +─ `─ Filter(?age > ?majorityAge)
   │  +─    `─ Scan[SPO](?person, :age, ?age) [#1]
   │  }
   `─ Scan[PSO](?country, <http://api.stardog.com/majorityAge>, ?majorityAge) [#1]

Another classical example where correlated execution is essential is Top-K subqueries. Consider the following simple example: given payroll data, find three highest paid employees in each department. The naive attempt to do it in pure SPARQL will fall short of the goal:

# this won't return desired results!
SELECT ?dept ?emp {
  ?dept a :Department
  { SELECT ?emp {
      ?emp :worksIn ?dept ;
           :salary ?salary 
    } ORDER BY desc(?salary) LIMIT 3
  }
} ORDER BY ?dept

The issue is again that the subquery is executed once and independently of the rest of the query. It will return the three highest paid employees across all departments. The intention is of course to execute the subquery for each department after binding ?dept to a value matched by the outer query.

This can be achieved using SQS as follows:

SELECT ?dept ?emp {
  ?dept a :Department
  SERVICE <query://employees> {
    [] sqs:inputs ?dept ; sqs:vars ?emp
  }
} ORDER BY ?dept

Just as for other variables, it’s possible to map input variables in the subquery to other main query’s variables, like ?dept to ?department in the following example. The engine will bind ?dept to the current value of ?department before each execution.

SELECT ?dept ?emp {
  ?department a :Department
  SERVICE <query://employees> {
    [] sqs:inputs ?department ; sqs:var:dept ?department ; sqs:vars ?emp
  }
} ORDER BY ?dept

Correlated execution is also supported for path subqueries. There it is particularly important given the potentially high number of returned paths if the subquery runs without inputs. In the following example the paths subquery is executed for each value of the start variable:

prefix sqs: <tag:stardog:api:sqs:>
prefix stardog: <tag:stardog:api:>

SELECT ?start (str(stardog:nodes(?path)) as ?pstr) {
   VALUES ?start { :X :Y :Z }
   SERVICE <query://paths> {
     [] sqs:inputs ?start ; sqs:vars ?path
   }
} GROUP BY ?start

It is similarly possible to indicate that variables appearing inside the VIA, START, or END patterns of a paths subquery are inputs.

Defining Dataset for Subqueries

SPARQL does not allow specifying the query dataset for subqueries i.e. one cannot use FROM or FROM NAMED keywords in a subquery. Subqueries inherit the dataset from the main query. SQS enables two ways of defining dataset for a subquery (in the order of precedence):

  • directly in the SQS pattern using sqs:default-graph and sqs:named-graph predicates.
  • inside the stored query using the standard FROM and FROM NAMED keywords.

If none of the above is used, the stored subquery will inherit the dataset from the main query.

Both sqs:default-graph and sqs:named-graph predicates can be used to specify multiple graphs to define the default and the named part of the query dataset (just like FROM and FROM NAMED keywords can be used multiple times):

SELECT * {
  SERVICE <query://name> {
    [] sqs:default-graph :g1, :g2 ; 
       sqs:named-graph :g3, :g4
  }
}

It is possible to use Named Graph Aliases both in stored subqueries and in the range of sqs:default-graph and sqs:named-graph predicates.