Recommended Metrics and Alerts

This page describes the most important metrics to monitor and recommended alert thresholds for production Stardog deployments.

Page Contents

Overview
What to Monitor
Memory Metrics
1. Remediation
Disk Metrics
1. Remediation
Cluster Replication Metrics
1. Remediation
Query Performance Metrics
1. Remediation
Transaction Metrics
1. Remediation
Server Load Metrics
1. Remediation
License Metrics
1. Remediation
Summary: Essential Dashboard

Overview

Stardog exposes hundreds of metrics via its monitoring system. While all of these metrics can be useful for deep-dive troubleshooting, a smaller subset is essential for day-to-day operational monitoring and alerting. This page identifies the most important metrics to track, organized by functional area, along with recommended alert thresholds and remediation steps.

For details on how to access these metrics (HTTP, CLI, Prometheus, JMX), see the Server Monitoring page.

What to Monitor

At a high level, production Stardog deployments should be monitored across these areas:

Area	Why It Matters
Memory	Prevent out-of-memory conditions and excessive GC
Disk	Prevent disk-full conditions that cause data loss or corruption
Cluster replication	Ensure nodes stay in sync (HA deployments)
Query performance	Detect slow queries and throughput regressions before users are impacted
Transaction health	Ensure write operations complete in a timely manner
Server load	Identify CPU saturation, resource exhaustion, and request queuing

Memory Metrics

These metrics track JVM and native memory usage.

Metric	What to Watch	Alert Threshold	Severity
`dbms.memory.system.usageRatio`	Process memory usage ratio	> 0.9 for > 5 min	Urgent
`dbms.memory.system.rss`	Current RSS (includes all memory used by the proces)	Covered by `usageRatio` above	—
`dbms.memory.system.rss.peak`	Peak RSS since process start	Covered by `usageRatio` above	—
`dbms.memory.native.query.blocks.used`	Native query block usage	Sustained value near `dbms.memory.native.query.blocks.max`	High
`jvm.gc.<collector>.time`	GC pause duration	Long or frequent pauses	High

dbms.memory.system.rss.peak is often more useful than current RSS for capacity planning, as the current value may not be representative of what the workload requires under peak load.

dbms.memory.native.used will normally sit near 90% of dbms.memory.native.max because this memory is pre-allocated for query blocks and RocksDB caches. High native usage alone is not a concern. Focus on RSS and GC behavior instead.

GC metrics are only available when metrics.jvm.enabled=true in stardog.properties. The <collector> name depends on the JVM’s garbage collector configuration. For example, with G1GC (the default on Java 21) the metrics are jvm.gc.G1-Young-Generation.time and jvm.gc.G1-Old-Generation.time. Each collector also exposes a .count metric for the number of collections.

Remediation

High RSS approaching physical memory: The process is at risk of being killed by the OS OOM killer. This is the most urgent memory alert.
Query block exhaustion: Queries are competing for limited memory blocks. Increase memory allocation or reduce query concurrency.
Long or frequent GC pauses: Indicates heap memory pressure. Long GC pauses can also destabilize clusters by causing heartbeat timeouts and node expulsions. Stardog may be under-provisioned for the workload. Increase -Xmx or reduce concurrent load.

Disk Metrics

Stardog exposes disk space metrics directly. Disk I/O utilization and latency are not exposed by Stardog and should be monitored at the infrastructure level (e.g., via node exporter, CloudWatch, or system tools like iostat).

Metric	What to Watch	Alert Threshold	Severity
`dbms.home.free.space`	Free space on the `$STARDOG_HOME` volume	< 20% of `dbms.home.total.space`	High
`dbms.home.free.space`	Free space on the `$STARDOG_HOME` volume	< 10% of `dbms.home.total.space`	Urgent
Disk I/O utilization (OS-level)	Storage subsystem saturation	> 90% for > 5 min	High
Disk I/O latency (OS-level)	Storage latency	> 20ms average for > 5 min	Medium

dbms.home.usable.space can be used instead of dbms.home.free.space for a more accurate measurement, as it accounts for OS-level permissions and reserved blocks that may not be available to the Stardog process.

These metrics only cover the $STARDOG_HOME volume. Stardog also writes temporary files to the temp directory (java.io.tmpdir, defaults to /tmp) and the spilling directory (spilling.dir, defaults to $STARDOG_HOME/.spilling). If these are on separate volumes, monitor their disk space independently. See Scratch Space for configuration details.

Remediation

Low disk space: Stardog requires free disk space for compaction, transaction logs, and temporary files. Running out of disk space can cause data corruption. Expand the volume or remove unused databases and backups.
High disk I/O: Storage engine stalls often correlate with saturated disk I/O. Consider faster storage (SSD/NVMe) or reducing concurrent write load.

Cluster Replication Metrics

These metrics are relevant for High Availability deployments.

In a cluster, each node reports its own metrics independently. You should collect and monitor metrics from every node, not just one. A sync or resource issue may only be visible on the affected node. If a node crashes, it will stop reporting metrics entirely — configure your monitoring system to alert when a node stops responding. Stardog provides two health endpoints: GET /admin/alive returns 200 if the node process is running (even if it is still syncing and has not yet joined the cluster), while GET /admin/healthcheck returns 200 only when the node has fully joined the cluster and is ready to serve requests. Use /admin/healthcheck as a readiness probe and /admin/alive as a liveness probe.

Under normal conditions, sync events should not occur. A sync is triggered when a node gets expelled from the cluster, crashes, or restarts and has missed transactions. The node will first attempt a partial sync (faster, replays missed transactions from the log) and fall back to a full sync (much slower, copies the entire database) if the partial sync fails.

Metric	What to Watch	Alert Threshold	Severity
`cluster.sync.attempt.count`	Partial sync attempts	Any increase outside of planned restarts	Medium
`cluster.sync.failure.count`	Partial sync failures	Any increase	High
`cluster.fullsync.attempts`	Full sync attempts	Any increase	High
`cluster.fullsync.failures`	Full sync failures	Any increase	Urgent

cluster.fullsync.attempts is not updated in versions prior to 12.0.1.

Remediation

Frequent sync attempts: Syncs should only occur after node restarts or expulsions. Frequent syncs indicate cluster instability, which can be caused by resource exhaustion (memory, CPU, disk), network connectivity issues, or GC pauses causing heartbeat timeouts.
Partial sync failures: A partial sync fails when the transaction log on the source node no longer contains the transactions the recovering node missed. This typically means the transaction log size is too small for the cluster’s workload. Increase the transaction log size to retain more history.
Full sync attempts: A full sync means a partial sync was not possible. Full syncs are slower because they copy the entire database. Investigate why partial syncs are failing (usually insufficient transaction log size).
Full sync failures: A full sync failed to complete. Check disk space on both source and target nodes, network bandwidth between nodes, and server logs for error details.

Query Performance Metrics

These metrics track how queries are performing across your databases. Replace YourDb with the actual database name.

Metric	What to Watch	Alert Threshold	Severity
`databases.YourDb.queries.latency.p99`	99th percentile query time	> 2x your baseline for > 5 min	High
`databases.YourDb.queries.latency.p95`	95th percentile query time	> 2x your baseline for > 5 min	Medium
`databases.YourDb.queries.latency.mean`	Average query time	Significant sustained increase	Medium
`databases.YourDb.queries.running`	Currently executing queries	Sustained value near thread pool size	High
`databases.YourDb.queries.memory.spilled`	Bytes spilled to disk	Non-zero and increasing rapidly	Medium
`databases.YourDb.planCache.ratio`	Query plan cache hit rate	< 0.5 for > 15 min	Low

Remediation

High latency: Check the query log for slow queries. Use explain to analyze query plans. Consider adding or updating statistics.
Low throughput: Investigate thread pool saturation (see Server Load Metrics) or disk I/O bottlenecks (see Disk Metrics).
Memory spilling: The server is running out of in-memory query blocks and spilling to disk. Consider increasing memory allocation. See Memory Management.
Low plan cache ratio: Many unique queries are being issued. Parameterize queries where possible to improve cache reuse.

Transaction Metrics

These metrics track write transaction performance.

Metric	What to Watch	Alert Threshold	Severity
`databases.YourDb.txns.latency.p99`	99th percentile transaction time	> 2x your baseline for > 5 min	High
`databases.YourDb.txns.latency.mean`	Average transaction latency	Significant sustained increase	Medium
`databases.YourDb.txns.openTransactions`	Currently open transactions	Sustained value > expected concurrency	Medium
`databases.YourDb.txns.size.max`	Largest recent transaction size	Unexpectedly large values	Low

Remediation

High transaction latency: Check for large transactions, compaction backlogs, or lock contention. Smaller, more frequent transactions generally perform better.
Many open transactions: Long-running transactions hold resources. Investigate whether clients are failing to commit or rollback.

Server Load Metrics

These metrics track CPU usage, request handling capacity, and thread pool health.

Metric	What to Watch	Alert Threshold	Severity
`system.cpu.usage`	CPU usage ratio	> 0.90 for > 5 min	High
`user.threads.active`	Active user-facing threads	Sustained value near `user.threads.size`	High
`user.threads.queued`	Queued user requests	Sustained increase over time	High
`admin.threads.active`	Active admin threads	Sustained value near `admin.threads.size`	Medium
`admin.threads.queued`	Queued admin requests	Sustained increase over time	Medium
`com.stardog.http.server-PORT.avgRequesttime.p99`	99th percentile HTTP request time	> 2x your baseline for > 5 min	Medium
`com.stardog.http.server-PORT.currentRequests`	In-flight HTTP requests	Sustained value near thread pool size	High
`jvm.os.file-descriptors.open`	Open file descriptors	> 80% of `jvm.os.file-descriptors.max`	High

File descriptor metrics are only available when metrics.jvm.enabled=true in stardog.properties.

Remediation

High CPU usage: Identify expensive queries via the query log. Consider optimizing queries, reducing concurrency, or scaling out.
Thread pool saturation: The server cannot keep up with incoming requests. Consider scaling out (add cluster nodes), reducing concurrent clients, or optimizing expensive queries.
Request queuing: Requests are waiting for threads. This is often a symptom of slow queries or transactions. Investigate the root cause rather than simply increasing pool size.
File descriptor exhaustion: Stardog uses file descriptors for database indices, client connections, and internal operations. Hitting the OS limit causes failures that can be difficult to diagnose. See Open Files Setting for configuration guidance.

License Metrics

Metric	What to Watch	Alert Threshold	Severity
`dbms.license.expiration`	Days until license expiration	< 30 days	Medium
`dbms.license.expiration`	Days until license expiration	< 7 days	Urgent

Remediation

License expiration: Contact Stardog support to renew your license. An expired license will prevent the server from starting.

Summary: Essential Dashboard

For a quick-start monitoring dashboard, prioritize these metrics:

Category	Key Metric	Why
Process memory	`dbms.memory.system.usageRatio`	Detect OOM risk
Disk space	`dbms.home.free.space`	Prevent disk-full conditions
Query performance	`databases.YourDb.queries.latency.p95`	Track user-facing performance
Query throughput	`databases.YourDb.queries.latency.m1_rate`	Track request volume
Write performance	`databases.YourDb.txns.latency.p95`	Track write latency
Thread saturation	`user.threads.queued`	Detect request queuing

Alert thresholds in this document are general recommendations. You should establish baselines specific to your workload and environment, then set thresholds relative to those baselines. What constitutes “normal” varies significantly between deployments.

Overview
What to Monitor
Memory Metrics
- Remediation
Disk Metrics
- Remediation
Cluster Replication Metrics
- Remediation
Query Performance Metrics
- Remediation
Transaction Metrics
- Remediation
Server Load Metrics
- Remediation
License Metrics
- Remediation
Summary: Essential Dashboard