Link Search Menu Expand Document
Start for Free

Literal Index

This page discusses an in-memory sorted index for numeric values occuring in the graph.

This feature is in beta.

Page Contents
  1. Overview
  2. Configuration and Logging
  3. Query Optimization
  4. Current limitations

Overview

Stardog 7.3.2 introduces a new kind of index: an in-memory sorted index for numerical values occurring in the graph. It’s aimed at improving performance of queries with range filters over values of numerical properties. For example, prior to 7.3.2 the filter in the following query:

SELECT * {
    ?product    a :Product ;
                :price ?price
    FILTER(?price > 1000)
}

would be applied to ?price values for all products which is inefficient if there are many products with a price below 1000. If the literal index is enabled, however, the engine would first scan it to obtain all price values in the range and then scan only the corresponding products. This is visible in the query plan:

Projection(?product, ?price)
`─ MergeJoin(?product)
   +─ Scan[POS](?product, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, :Product)
   `─ Sort(?product)
      `─ DirectHashJoin(?price)
         +─ LiteralRangeScan("1000"^^xsd:integer, inf) -> price
         `─ Scan[POS](?product, :price, ?price)

The LiteralRangeScan operator scans the literal index so the matching ?price values can be efficiently joined with the ?product :price ?price pattern. The supposedly small result of that join is then further joined with ?product a :Product pattern to obtain the final query result. See below for more information on query optimization when the literal index is enabled.

Configuration and Logging

The literal index is disabled by default. To enable it, specify a comma-separated list of IRIs of properties whose values should be indexed using the index.literals.properties database option. The option accepts two meta IRIs: tag:stardog:api:property:all and tag:stardog:api:property:none meaning that all properties should be indexed or none, respectively (the latter is the default value).

In addition to that, there are two options to put a cap on the total size of the literal index:

  1. index.literals.limit
  2. index.literals.merge.limit - the size of the differential literal index

The former is straightforward: it prevents the index from growing beyond the limit and consuming too much of server’s memory. After reaching that limit, the index is dropped and no longer used in queries. The latter controls the max size of the index that’s maintained separately during transactions to avoid updates to the main literal index on every small change to the data. On reaching that limit the differential index is merged into the main literal index.

Detailed information regarding building, maintaining the literal index, as well as using it in queries will be printed in stardog.log by adding the following <Logger> into your log4j2.xml. See the Logging section for more details on modifying this file.

<Logger name="com.complexible.stardog.db.index.literals" level="DEBUG" additivity="false">
  <AppenderRef ref="stardogAppender"/>
</Logger>

Query Optimization

The literal index is fully integrated into the Stardog’s query engine. The optimizer will decide whether the index is applicable to a query (that is, if it’s up-to-date and the query uses filters over indexed properties) but also if its use would actually speed up the query. The latter is not necessarily true, consider the following example:

SELECT * {
    ?product    a :Product ;
                :from ?supplier ;
                :price ?price .
    ?vendor     :city :LittleRock
    FILTER(?price > 10)
}

Assuming vendors from Little Rock don’t produce many products and the ?price > 10 predicate is not selective (many price points in that range in the database), the optimizer will probably decide to join vendors and products first and then simply apply the filter to the (supposedly small) result set. On the other hand, when the price range is tighter but the vendor predicate is not selective (let’s say it’s the state of California), the optimizer would first scan the literal index to obtain the price points in the range, and then look up corresponding products and vendors.

In either case, the query plan with cardinality estimations will show how the query will be evaluated. As always, there’s the possibility to force the optimizer to either use or not use the literal index by specifying the #pragma literal.index hint. It admits the following values: aggressive (always use the index if available), off (never use it) and default (the same as not having the hint at all).

Current limitations

The following needs to be kept in mind when using the literal index while it is in beta:

  • The index is in-memory only and is not persisted. As such, it is rebuilt on each server restart. The time it takes to rebuild the index is proportional to the number of triples with numerical predicates (configured via index.literals.properties). Each predicate is indexed concurrently with others.
  • The index is currently stored on Java heap. Its memory footprint depends greatly on the data but when literal inlining is enabled (via index.literals.canonical which is true by default) most numerical values are encoded as 64-bit integers. Then each literal index entry takes around 100 bytes on heap. Still, it increases heap consumption so might require a higher value for -Xmx (see Capacity Planning for more information).
  • While the index supports transactional updates, we do not currently recommend to use it when values of numerical properties are frequently updated (i.e. those triples often added or deleted). It may have a negative impact on transactional throughput and increase GC pressure on the system. The index can instead be enabled after the data is relatively stable and read query performance is prioritized.
  • The index is managed as a single data structure for all indexed properties across all named graphs. If there’re many indexed properties with overlapping ranges, it can have a negative impact on cardinality estimations (i.e. the optimizer can believe that there’re many numerical values in the range while they could be for different properties than used in the query).