Literal Index
This page discusses an in-memory sorted index for numeric values occurring in the graph.
This feature is in beta.
Overview
Stardog 7.3.2 introduces a new kind of index: an in-memory sorted index for numerical values occurring in the graph. It’s aimed at improving performance of queries with range filters over values of numerical properties. For example, prior to 7.3.2, the filter in the following query:
SELECT * {
?product a :Product ;
:price ?price
FILTER(?price > 1000)
}
would be applied to ?price
values for all products, which is inefficient if there are many products with a price below 1000
. If the literal index is enabled, however, the engine would first scan it to obtain all price values in the range and then scan only the corresponding products. This is visible in the query plan:
Projection(?product, ?price)
`─ MergeJoin(?product)
+─ Scan[POS](?product, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, :Product)
`─ Sort(?product)
`─ DirectHashJoin(?price)
+─ LiteralRangeScan("1000"^^xsd:integer, inf) -> price
`─ Scan[POS](?product, :price, ?price)
The LiteralRangeScan
operator scans the literal index so the matching ?price
values can be efficiently joined with the ?product :price ?price
pattern. The supposedly small result of that join is then further joined with ?product a :Product
pattern to obtain the final query result. See below for more information on query optimization when the literal index is enabled.
Configuration and Logging
The literal index is disabled by default. To enable it, specify a comma-separated list of IRIs of properties whose values should be indexed using the index.literals.properties
database option. The option accepts two meta IRIs: tag:stardog:api:property:all
and tag:stardog:api:property:none
meaning that all properties should be indexed or none, respectively (the latter is the default value).
In addition to that, there are two options to put a cap on the total size of the literal index:
index.literals.limit
index.literals.merge.limit
- the size of the differential literal index
The former is straightforward: it prevents the index from growing beyond the limit and consuming too much of the server’s memory. After reaching that limit, the index is dropped and no longer used in queries. The latter controls the max size of the index that’s maintained separately during transactions to avoid updates to the main literal index on every small change to the data. On reaching that limit the differential index is merged into the main literal index.
Detailed information regarding building and maintaining the literal index, as well as using it in queries, will be printed in stardog.log
by adding the following <Logger>
into your log4j2.xml
. See the Logging section for more details on modifying this file.
<Logger name="com.complexible.stardog.db.index.literals" level="DEBUG" additivity="false">
<AppenderRef ref="stardogAppender"/>
</Logger>
Query Optimization
The literal index is fully integrated into Stardog’s query engine. The optimizer will decide whether the index is applicable to a query (that is, if it’s up-to-date and the query uses filters over indexed properties) but also if its use would actually speed up the query. The latter is not necessarily true, consider the following example:
SELECT * {
?product a :Product ;
:from ?supplier ;
:price ?price .
?vendor :city :LittleRock
FILTER(?price > 10)
}
Assuming vendors from Little Rock don’t produce many products and the ?price > 10
predicate is not selective (i.e., there are many price points in that range in the database), the optimizer will probably decide to join vendors and products first and then simply apply the filter to the (supposedly small) result set. On the other hand, when the price range is tighter but the vendor predicate is not selective (let’s say it’s the state of California), the optimizer would first scan the literal index to obtain the price points in the range, and then look up corresponding products and vendors.
In either case, the query plan with cardinality estimations will show how the query will be evaluated. As always, there’s the possibility to force the optimizer to either use or not use the literal index by specifying the #pragma literal.index
hint. It admits the following values: aggressive
(always use the index if available), off
(never use it) and default
(the same as not having the hint at all).
Current limitations
The following needs to be kept in mind when using the literal index, as it is in beta:
- The index is in-memory only and is not persisted. As such, it is rebuilt on each server restart. The time it takes to rebuild the index is proportional to the number of triples with numerical predicates (configured via
index.literals.properties
). Each predicate is indexed concurrently with others. - The index is currently stored on Java heap. Its memory footprint depends greatly on the data, but when literal inlining is enabled (via
index.literals.canonical
, which istrue
by default) most numerical values are encoded as 64-bit integers. Then each literal index entry takes around 100 bytes on heap. Still, it increases heap consumption, so it might require a higher value for-Xmx
(see Capacity Planning for more information). - While the index supports transactional updates, we do not currently recommend using it when values of numerical properties are frequently updated (i.e. those triples often added or deleted). It may have a negative impact on transactional throughput and increase GC pressure on the system. The index can instead be enabled after the data is relatively stable and read query performance is prioritized.
- The index is managed as a single data structure for all indexed properties across all named graphs. If there are many indexed properties with overlapping ranges, it can have a negative impact on cardinality estimations (i.e., the optimizer can believe there are many numerical values in the range, while they could be for different properties than those used in the query).