Link Search Menu Expand Document
Start for Free

Recommended Metrics and Alerts

This page describes the most important metrics to monitor and recommended alert thresholds for production Stardog deployments.

Page Contents
  1. Overview
  2. What to Monitor
  3. Memory Metrics
    1. Remediation
  4. Disk Metrics
    1. Remediation
  5. Cluster Replication Metrics
    1. Remediation
  6. Query Performance Metrics
    1. Remediation
  7. Transaction Metrics
    1. Remediation
  8. Server Load Metrics
    1. Remediation
  9. License Metrics
    1. Remediation
  10. Summary: Essential Dashboard

Overview

Stardog exposes hundreds of metrics via its monitoring system. While all of these metrics can be useful for deep-dive troubleshooting, a smaller subset is essential for day-to-day operational monitoring and alerting. This page identifies the most important metrics to track, organized by functional area, along with recommended alert thresholds and remediation steps.

For details on how to access these metrics (HTTP, CLI, Prometheus, JMX), see the Server Monitoring page.

What to Monitor

At a high level, production Stardog deployments should be monitored across these areas:

Area Why It Matters
Memory Prevent out-of-memory conditions and excessive GC
Disk Prevent disk-full conditions that cause data loss or corruption
Cluster replication Ensure nodes stay in sync (HA deployments)
Query performance Detect slow queries and throughput regressions before users are impacted
Transaction health Ensure write operations complete in a timely manner
Server load Identify CPU saturation, resource exhaustion, and request queuing

Memory Metrics

These metrics track JVM and native memory usage.

Metric What to Watch Alert Threshold Severity
dbms.memory.system.usageRatio Process memory usage ratio > 0.9 for > 5 min Urgent
dbms.memory.system.rss Current RSS (includes all memory used by the proces) Covered by usageRatio above
dbms.memory.system.rss.peak Peak RSS since process start Covered by usageRatio above
dbms.memory.native.query.blocks.used Native query block usage Sustained value near dbms.memory.native.query.blocks.max High
jvm.gc.<collector>.time GC pause duration Long or frequent pauses High

dbms.memory.system.rss.peak is often more useful than current RSS for capacity planning, as the current value may not be representative of what the workload requires under peak load.

dbms.memory.native.used will normally sit near 90% of dbms.memory.native.max because this memory is pre-allocated for query blocks and RocksDB caches. High native usage alone is not a concern. Focus on RSS and GC behavior instead.

GC metrics are only available when metrics.jvm.enabled=true in stardog.properties. The <collector> name depends on the JVM’s garbage collector configuration. For example, with G1GC (the default on Java 21) the metrics are jvm.gc.G1-Young-Generation.time and jvm.gc.G1-Old-Generation.time. Each collector also exposes a .count metric for the number of collections.

Remediation

  • High RSS approaching physical memory: The process is at risk of being killed by the OS OOM killer. This is the most urgent memory alert.
  • Query block exhaustion: Queries are competing for limited memory blocks. Increase memory allocation or reduce query concurrency.
  • Long or frequent GC pauses: Indicates heap memory pressure. Long GC pauses can also destabilize clusters by causing heartbeat timeouts and node expulsions. Stardog may be under-provisioned for the workload. Increase -Xmx or reduce concurrent load.

Disk Metrics

Stardog exposes disk space metrics directly. Disk I/O utilization and latency are not exposed by Stardog and should be monitored at the infrastructure level (e.g., via node exporter, CloudWatch, or system tools like iostat).

Metric What to Watch Alert Threshold Severity
dbms.home.free.space Free space on the $STARDOG_HOME volume < 20% of dbms.home.total.space High
dbms.home.free.space Free space on the $STARDOG_HOME volume < 10% of dbms.home.total.space Urgent
Disk I/O utilization (OS-level) Storage subsystem saturation > 90% for > 5 min High
Disk I/O latency (OS-level) Storage latency > 20ms average for > 5 min Medium

dbms.home.usable.space can be used instead of dbms.home.free.space for a more accurate measurement, as it accounts for OS-level permissions and reserved blocks that may not be available to the Stardog process.

These metrics only cover the $STARDOG_HOME volume. Stardog also writes temporary files to the temp directory (java.io.tmpdir, defaults to /tmp) and the spilling directory (spilling.dir, defaults to $STARDOG_HOME/.spilling). If these are on separate volumes, monitor their disk space independently. See Scratch Space for configuration details.

Remediation

  • Low disk space: Stardog requires free disk space for compaction, transaction logs, and temporary files. Running out of disk space can cause data corruption. Expand the volume or remove unused databases and backups.
  • High disk I/O: Storage engine stalls often correlate with saturated disk I/O. Consider faster storage (SSD/NVMe) or reducing concurrent write load.

Cluster Replication Metrics

These metrics are relevant for High Availability deployments.

In a cluster, each node reports its own metrics independently. You should collect and monitor metrics from every node, not just one. A sync or resource issue may only be visible on the affected node. If a node crashes, it will stop reporting metrics entirely — configure your monitoring system to alert when a node stops responding. Stardog provides two health endpoints: GET /admin/alive returns 200 if the node process is running (even if it is still syncing and has not yet joined the cluster), while GET /admin/healthcheck returns 200 only when the node has fully joined the cluster and is ready to serve requests. Use /admin/healthcheck as a readiness probe and /admin/alive as a liveness probe.

Under normal conditions, sync events should not occur. A sync is triggered when a node gets expelled from the cluster, crashes, or restarts and has missed transactions. The node will first attempt a partial sync (faster, replays missed transactions from the log) and fall back to a full sync (much slower, copies the entire database) if the partial sync fails.

Metric What to Watch Alert Threshold Severity
cluster.sync.attempt.count Partial sync attempts Any increase outside of planned restarts Medium
cluster.sync.failure.count Partial sync failures Any increase High
cluster.fullsync.attempts Full sync attempts Any increase High
cluster.fullsync.failures Full sync failures Any increase Urgent

cluster.fullsync.attempts is not updated in versions prior to 12.0.1.

Remediation

  • Frequent sync attempts: Syncs should only occur after node restarts or expulsions. Frequent syncs indicate cluster instability, which can be caused by resource exhaustion (memory, CPU, disk), network connectivity issues, or GC pauses causing heartbeat timeouts.
  • Partial sync failures: A partial sync fails when the transaction log on the source node no longer contains the transactions the recovering node missed. This typically means the transaction log size is too small for the cluster’s workload. Increase the transaction log size to retain more history.
  • Full sync attempts: A full sync means a partial sync was not possible. Full syncs are slower because they copy the entire database. Investigate why partial syncs are failing (usually insufficient transaction log size).
  • Full sync failures: A full sync failed to complete. Check disk space on both source and target nodes, network bandwidth between nodes, and server logs for error details.

Query Performance Metrics

These metrics track how queries are performing across your databases. Replace YourDb with the actual database name.

Metric What to Watch Alert Threshold Severity
databases.YourDb.queries.latency.p99 99th percentile query time > 2x your baseline for > 5 min High
databases.YourDb.queries.latency.p95 95th percentile query time > 2x your baseline for > 5 min Medium
databases.YourDb.queries.latency.mean Average query time Significant sustained increase Medium
databases.YourDb.queries.running Currently executing queries Sustained value near thread pool size High
databases.YourDb.queries.memory.spilled Bytes spilled to disk Non-zero and increasing rapidly Medium
databases.YourDb.planCache.ratio Query plan cache hit rate < 0.5 for > 15 min Low

Remediation

  • High latency: Check the query log for slow queries. Use explain to analyze query plans. Consider adding or updating statistics.
  • Low throughput: Investigate thread pool saturation (see Server Load Metrics) or disk I/O bottlenecks (see Disk Metrics).
  • Memory spilling: The server is running out of in-memory query blocks and spilling to disk. Consider increasing memory allocation. See Memory Management.
  • Low plan cache ratio: Many unique queries are being issued. Parameterize queries where possible to improve cache reuse.

Transaction Metrics

These metrics track write transaction performance.

Metric What to Watch Alert Threshold Severity
databases.YourDb.txns.latency.p99 99th percentile transaction time > 2x your baseline for > 5 min High
databases.YourDb.txns.latency.mean Average transaction latency Significant sustained increase Medium
databases.YourDb.txns.openTransactions Currently open transactions Sustained value > expected concurrency Medium
databases.YourDb.txns.size.max Largest recent transaction size Unexpectedly large values Low

Remediation

  • High transaction latency: Check for large transactions, compaction backlogs, or lock contention. Smaller, more frequent transactions generally perform better.
  • Many open transactions: Long-running transactions hold resources. Investigate whether clients are failing to commit or rollback.

Server Load Metrics

These metrics track CPU usage, request handling capacity, and thread pool health.

Metric What to Watch Alert Threshold Severity
system.cpu.usage CPU usage ratio > 0.90 for > 5 min High
user.threads.active Active user-facing threads Sustained value near user.threads.size High
user.threads.queued Queued user requests Sustained increase over time High
admin.threads.active Active admin threads Sustained value near admin.threads.size Medium
admin.threads.queued Queued admin requests Sustained increase over time Medium
com.stardog.http.server-PORT.avgRequesttime.p99 99th percentile HTTP request time > 2x your baseline for > 5 min Medium
com.stardog.http.server-PORT.currentRequests In-flight HTTP requests Sustained value near thread pool size High
jvm.os.file-descriptors.open Open file descriptors > 80% of jvm.os.file-descriptors.max High

File descriptor metrics are only available when metrics.jvm.enabled=true in stardog.properties.

Remediation

  • High CPU usage: Identify expensive queries via the query log. Consider optimizing queries, reducing concurrency, or scaling out.
  • Thread pool saturation: The server cannot keep up with incoming requests. Consider scaling out (add cluster nodes), reducing concurrent clients, or optimizing expensive queries.
  • Request queuing: Requests are waiting for threads. This is often a symptom of slow queries or transactions. Investigate the root cause rather than simply increasing pool size.
  • File descriptor exhaustion: Stardog uses file descriptors for database indices, client connections, and internal operations. Hitting the OS limit causes failures that can be difficult to diagnose. See Open Files Setting for configuration guidance.

License Metrics

Metric What to Watch Alert Threshold Severity
dbms.license.expiration Days until license expiration < 30 days Medium
dbms.license.expiration Days until license expiration < 7 days Urgent

Remediation

  • License expiration: Contact Stardog support to renew your license. An expired license will prevent the server from starting.

Summary: Essential Dashboard

For a quick-start monitoring dashboard, prioritize these metrics:

Category Key Metric Why
Process memory dbms.memory.system.usageRatio Detect OOM risk
Disk space dbms.home.free.space Prevent disk-full conditions
Query performance databases.YourDb.queries.latency.p95 Track user-facing performance
Query throughput databases.YourDb.queries.latency.m1_rate Track request volume
Write performance databases.YourDb.txns.latency.p95 Track write latency
Thread saturation user.threads.queued Detect request queuing

Alert thresholds in this document are general recommendations. You should establish baselines specific to your workload and environment, then set thresholds relative to those baselines. What constitutes “normal” varies significantly between deployments.