Link Search Menu Expand Document
Start for Free

Capacity Planning

Page Contents
  1. Overview
  2. Memory Usage
  3. Disk Usage

Overview

The primary system resources used by Stardog are CPU, memory, and disk. Stardog also uses file handles and sockets, but we don’t discuss those here. Stardog will take advantage of more than one CPU, core, and core-based thread in data loading and in throughput-heavy or multi-user loads. Stardog performance is influenced by the speed of CPUs and cores. However, some workloads are bound by main memory or disk I/O (or both) more than CPU. Use the fastest CPUs you can afford with the largest secondary caches and greatest number of cores and core-based threads of execution, especially in multi-user workloads.

We recommened a minimum of 8 CPUs for production workloads. The cluster uses additional CPUs for replication and to handle events from ZooKeeper. Therefore, 16 CPUs is a reasonable starting point for production cluster workloads. However, as with all resource recommendations it is extremely important that you run and test your own workloads to determine the optimal configuration. Various workload patterns, Stardog features, or data characteristics can impact resource usage making it impossible for general guidelines to address. If Stardog is resource constrained it can fail in unexpected ways (e.g., disk out of space errors may cause data loss or missed ZooKeeper events in the cluster can cause inconsistencies or unexpected node drops).

The following sections provide a starting point for more detailed guidance for the memory and disk resource requirements of Stardog. You may discover that they are not optimal for your specific workloads and therefore require adjustments (either allocating more or less resources based on your usage).

Memory Usage

Stardog uses system memory aggressively, and the total system memory available to Stardog is often the most important factor in performance. Stardog uses both JVM memory (heap memory) and also the operating system memory outside the JVM (direct or native memory). Having more system memory available is always good; however, increasing the total memory limit too close to total system memory is not prudent, as the operating system will not have enough memory for its own operations (see guidelines below).

The following table shows recommended system memory for Stardog to be used in production. The info in this table is based on how much data is stored locally in Stardog and shows how the system memory should be divided between the JVM heap memory and direct memory. The number of triples refers to the total number of triples stored in aggregate over all the databases.

Note that the exact amount of memory needed can vary a lot depending on many factors other than graph size. These include the characteristics of your graph, amount of data processed by queries, virtual graph access patterns, transactional load on the system, amount of data bulk loaded into new databases, number of concurrent users, and so on.

The values in this table can be used as a guideline, but the only way to make sure you have optimal settings is to try your workload on your data and analyze the memory metrics provided by Stardog.

Number of Triples JVM Heap Memory Direct Memory Total System Memory
Less than 1 billion 8G 20G 32G
1 billion 16G 40G 64G
10 billion 30G 80G 128G
25 billion 60G 160G 256G
50 billion 80G 380G 512G

For production usage, we recommend running Stardog on a server with at least 32GB of system memory, regardless of data volume. Out of the box, Stardog sets the maximum JVM memory to 2GB and direct memory to 1GB, which is meant for testing small databases (less than 100M triples) in non-production usage. Use the above table as guidance for configuring your production environment.

You can increase the memory for Stardog by setting the system property STARDOG_SERVER_JAVA_ARGS using the standard JVM options. For example, you can set this property to "-Xms8g -Xmx8g -XX:MaxDirectMemorySize=20g" to increase the JVM memory to 8GB and off-heap to 20GB. We recommend setting the minimum heap size (-Xms option) and max heap size (-Xmx option) to the same value.

Some general guidelines that can be used in addition to the above table:

  1. Heap memory must be set to a minimum of 2GB, and setting it higher than 100GB is typically not recommended due to increased GC pauses.
  2. JVM uses Compressed OOPs optimization if the heap limit is less than 32GB. If you want to set the heap limit higher than 32GB, you will only see noticeable benefits if you go to 50-60GB or higher.
  3. Direct memory should be set higher than heap memory for optimal performance.
  4. The sum of heap and direct memory settings should be around 90% of the total system memory available so the operating system has enough memory for its own operations.
  5. It is not recommended to run any other memory-intensive application on the same machine that a Stardog server is running, as those applications would compete for the same resources. If the overall memory usage in the system increases to dangerously high levels, the operating system or the container will kill the Stardog process.

Disk Usage

Stardog stores data on disk in a compressed format. The disk space needed for a database depends on many factors besides the number of triples, including the number of unique resources and literals in the data, average length of resource identifiers and literals, and how much the data is compressed. As a general rule of thumb, every million triples require 70 MB to 100 MB of disk space. The actual disk usage for a database may be different in practice. It is also important to realize the amount of disk space needed at creation time for bulk loading data is higher, as temporary files will be created. The extra disk space needed at bulk loading time can be 40% to 70% of the final database size.

The disk space used by Stardog is additive for more than one database, and there is little disk space used other than what is required for the databases. To calculate the total disk space needed for more than one database, one may sum the disk space needed by each database.