Sampling service
This page discusses the Sampling Service, which allows for random sampling without replacement, from the set of result triples matched by a SPARQL pattern.
Page Contents
Overview
The Sampling Service allows for random sampling from the set of results matched by a particular SPARQL triple pattern. Sampling without replacement is useful for training and testing ML models, data exploration and visualization.
An example of using the sample service follows below.
- The sample is always a subset of the results matched by the enclosed triple pattern, i.e.
?resource a ?resourceType
in the following example. - We do not guarantee any particular sampling distribution, e.g. uniform or Gaussian.
- Returned sample size may be smaller than the value of
smp:size
.
prefix smp: <tag:stardog:api:sample:>
SELECT ?resource {
service <tag:stardog:api:sample> {
?resource a ?resourceType .
[] smp:size 10000 ;
}
}
There are some preconditions to using the sampling service in a sensible way.
- The data is well compacted on disk i.e. after running
stardog-admin db optimize
, or data was imported during db creation - The sample size is much smaller than the total amount of data matched by the triple pattern.
- Sampling only needs to read a subset of underlying data files to produce results.
Best results may be obtained when the query results are unlikely to change: Whether a full triple pattern is used or just a sample. For example, just getting the distinct outgoing predicates for :Product
instances. We assume the number of distinct predicates is low, so we just need a good sample of :Product
instances to start from.
prefix smp: <tag:stardog:api:sample:>
SELECT DISTINCT ?predicate {
service <tag:stardog:api:sample> {
?resource a :Product .
[] smp:size 10000 ;
}
?resource ?predicate []
}
Another Example on a lubm generated dataset
prefix lubm: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
prefix smp: <tag:stardog:api:sample:>
select distinct ?p1 ?p2 where {
service <tag:stardog:api:sample> {
?start a lubm:GraduateStudent .
[] smp:size 1000
}
?start ?p1 ?middle . ?middle ?p2 ?end .
?end a lubm:University
}
limit 100
Internals
There are several parameters which may influence how sampling is performed. Sampling occurs in one of two modes: random
or fast
. By default the random
mode is used.
The sampling service is designed so it may be an order of magnitude faster than a full index scan with reservoir sampling. We can achieve this by just scanning a fraction of the actual dataset; the sampling service may skip over data-files which the stardog storage engine created. This means that the statistical quality of the sample may depend on how the indexed triples are distributed across the data-files and can be improved by tuning the default parameters.
The sampling service first selects a random subset from the underlying data files. The number of included data files can be controlled via the ratio
parameter. This parameter specifies the probability with which each underlying file is included. The valid range for this parameter is (0.0, 1.0]
.
In the default random
mode, every triple in the selected data-files is included in the sample with the probability of p(include) = sampleSize / totalTriples
. The value of totalTriples
here is an estimation of all triples matched by the enclosed triple pattern. This number is estimated using internal statistics, but a hint may be provided too. Alternatively in the fast
mode, only the first k triples of each data file are included, such that the sum over all k is equal to the specified sample size. Here the quality of the sample may be bad, but it could still be sufficient for some applications.
Example:
prefix smp: <tag:stardog:api:sample:>
SELECT ?resource {
SERVICE <tag:stardog:api:sample> {
?resource a ?resourceType .
[] smp:size 100000 ;
smp:ratio 0.75 ;
smp:mode "fast".
} }
}
Limitations
Only single triple patterns are supported inside service <tag:stardog:api:sample>
; arbitrary Basic Graph Patterns (BGPs) are not supported at this time.
Parameter Table
Option name | Default | Description |
---|---|---|
size | required | Desired sample size to return |
ratio | 0.5 | Ratio of underlying storage files to include. Value 1.0 includes all files |
mode | random | random scans all entries, fast scans just the beginning of underlying files |
total | 0 | Hint of total number of triples matched. Zero or negative number will make the service use internal statistics |