Recommended Metrics and Alerts
This page describes the most important metrics to monitor and recommended alert thresholds for production Stardog deployments.
Page Contents
Overview
Stardog exposes hundreds of metrics via its monitoring system. While all of these metrics can be useful for deep-dive troubleshooting, a smaller subset is essential for day-to-day operational monitoring and alerting. This page identifies the most important metrics to track, organized by functional area, along with recommended alert thresholds and remediation steps.
For details on how to access these metrics (HTTP, CLI, Prometheus, JMX), see the Server Monitoring page.
What to Monitor
At a high level, production Stardog deployments should be monitored across these areas:
| Area | Why It Matters |
|---|---|
| Memory | Prevent out-of-memory conditions and excessive GC |
| Disk | Prevent disk-full conditions that cause data loss or corruption |
| Cluster replication | Ensure nodes stay in sync (HA deployments) |
| Query performance | Detect slow queries and throughput regressions before users are impacted |
| Transaction health | Ensure write operations complete in a timely manner |
| Server load | Identify CPU saturation, resource exhaustion, and request queuing |
Memory Metrics
These metrics track JVM and native memory usage.
| Metric | What to Watch | Alert Threshold | Severity |
|---|---|---|---|
dbms.memory.system.usageRatio | Process memory usage ratio | > 0.9 for > 5 min | Urgent |
dbms.memory.system.rss | Current RSS (includes all memory used by the proces) | Covered by usageRatio above | — |
dbms.memory.system.rss.peak | Peak RSS since process start | Covered by usageRatio above | — |
dbms.memory.native.query.blocks.used | Native query block usage | Sustained value near dbms.memory.native.query.blocks.max | High |
jvm.gc.<collector>.time | GC pause duration | Long or frequent pauses | High |
dbms.memory.system.rss.peak is often more useful than current RSS for capacity planning, as the current value may not be representative of what the workload requires under peak load.
dbms.memory.native.used will normally sit near 90% of dbms.memory.native.max because this memory is pre-allocated for query blocks and RocksDB caches. High native usage alone is not a concern. Focus on RSS and GC behavior instead.
GC metrics are only available when metrics.jvm.enabled=true in stardog.properties. The <collector> name depends on the JVM’s garbage collector configuration. For example, with G1GC (the default on Java 21) the metrics are jvm.gc.G1-Young-Generation.time and jvm.gc.G1-Old-Generation.time. Each collector also exposes a .count metric for the number of collections.
Remediation
- High RSS approaching physical memory: The process is at risk of being killed by the OS OOM killer. This is the most urgent memory alert.
- Query block exhaustion: Queries are competing for limited memory blocks. Increase memory allocation or reduce query concurrency.
- Long or frequent GC pauses: Indicates heap memory pressure. Long GC pauses can also destabilize clusters by causing heartbeat timeouts and node expulsions. Stardog may be under-provisioned for the workload. Increase
-Xmxor reduce concurrent load.
Disk Metrics
Stardog exposes disk space metrics directly. Disk I/O utilization and latency are not exposed by Stardog and should be monitored at the infrastructure level (e.g., via node exporter, CloudWatch, or system tools like iostat).
| Metric | What to Watch | Alert Threshold | Severity |
|---|---|---|---|
dbms.home.free.space | Free space on the $STARDOG_HOME volume | < 20% of dbms.home.total.space | High |
dbms.home.free.space | Free space on the $STARDOG_HOME volume | < 10% of dbms.home.total.space | Urgent |
| Disk I/O utilization (OS-level) | Storage subsystem saturation | > 90% for > 5 min | High |
| Disk I/O latency (OS-level) | Storage latency | > 20ms average for > 5 min | Medium |
dbms.home.usable.space can be used instead of dbms.home.free.space for a more accurate measurement, as it accounts for OS-level permissions and reserved blocks that may not be available to the Stardog process.
These metrics only cover the $STARDOG_HOME volume. Stardog also writes temporary files to the temp directory (java.io.tmpdir, defaults to /tmp) and the spilling directory (spilling.dir, defaults to $STARDOG_HOME/.spilling). If these are on separate volumes, monitor their disk space independently. See Scratch Space for configuration details.
Remediation
- Low disk space: Stardog requires free disk space for compaction, transaction logs, and temporary files. Running out of disk space can cause data corruption. Expand the volume or remove unused databases and backups.
- High disk I/O: Storage engine stalls often correlate with saturated disk I/O. Consider faster storage (SSD/NVMe) or reducing concurrent write load.
Cluster Replication Metrics
These metrics are relevant for High Availability deployments.
In a cluster, each node reports its own metrics independently. You should collect and monitor metrics from every node, not just one. A sync or resource issue may only be visible on the affected node. If a node crashes, it will stop reporting metrics entirely — configure your monitoring system to alert when a node stops responding. Stardog provides two health endpoints: GET /admin/alive returns 200 if the node process is running (even if it is still syncing and has not yet joined the cluster), while GET /admin/healthcheck returns 200 only when the node has fully joined the cluster and is ready to serve requests. Use /admin/healthcheck as a readiness probe and /admin/alive as a liveness probe.
Under normal conditions, sync events should not occur. A sync is triggered when a node gets expelled from the cluster, crashes, or restarts and has missed transactions. The node will first attempt a partial sync (faster, replays missed transactions from the log) and fall back to a full sync (much slower, copies the entire database) if the partial sync fails.
| Metric | What to Watch | Alert Threshold | Severity |
|---|---|---|---|
cluster.sync.attempt.count | Partial sync attempts | Any increase outside of planned restarts | Medium |
cluster.sync.failure.count | Partial sync failures | Any increase | High |
cluster.fullsync.attempts | Full sync attempts | Any increase | High |
cluster.fullsync.failures | Full sync failures | Any increase | Urgent |
cluster.fullsync.attempts is not updated in versions prior to 12.0.1.
Remediation
- Frequent sync attempts: Syncs should only occur after node restarts or expulsions. Frequent syncs indicate cluster instability, which can be caused by resource exhaustion (memory, CPU, disk), network connectivity issues, or GC pauses causing heartbeat timeouts.
- Partial sync failures: A partial sync fails when the transaction log on the source node no longer contains the transactions the recovering node missed. This typically means the transaction log size is too small for the cluster’s workload. Increase the transaction log size to retain more history.
- Full sync attempts: A full sync means a partial sync was not possible. Full syncs are slower because they copy the entire database. Investigate why partial syncs are failing (usually insufficient transaction log size).
- Full sync failures: A full sync failed to complete. Check disk space on both source and target nodes, network bandwidth between nodes, and server logs for error details.
Query Performance Metrics
These metrics track how queries are performing across your databases. Replace YourDb with the actual database name.
| Metric | What to Watch | Alert Threshold | Severity |
|---|---|---|---|
databases.YourDb.queries.latency.p99 | 99th percentile query time | > 2x your baseline for > 5 min | High |
databases.YourDb.queries.latency.p95 | 95th percentile query time | > 2x your baseline for > 5 min | Medium |
databases.YourDb.queries.latency.mean | Average query time | Significant sustained increase | Medium |
databases.YourDb.queries.running | Currently executing queries | Sustained value near thread pool size | High |
databases.YourDb.queries.memory.spilled | Bytes spilled to disk | Non-zero and increasing rapidly | Medium |
databases.YourDb.planCache.ratio | Query plan cache hit rate | < 0.5 for > 15 min | Low |
Remediation
- High latency: Check the query log for slow queries. Use
explainto analyze query plans. Consider adding or updating statistics. - Low throughput: Investigate thread pool saturation (see Server Load Metrics) or disk I/O bottlenecks (see Disk Metrics).
- Memory spilling: The server is running out of in-memory query blocks and spilling to disk. Consider increasing memory allocation. See Memory Management.
- Low plan cache ratio: Many unique queries are being issued. Parameterize queries where possible to improve cache reuse.
Transaction Metrics
These metrics track write transaction performance.
| Metric | What to Watch | Alert Threshold | Severity |
|---|---|---|---|
databases.YourDb.txns.latency.p99 | 99th percentile transaction time | > 2x your baseline for > 5 min | High |
databases.YourDb.txns.latency.mean | Average transaction latency | Significant sustained increase | Medium |
databases.YourDb.txns.openTransactions | Currently open transactions | Sustained value > expected concurrency | Medium |
databases.YourDb.txns.size.max | Largest recent transaction size | Unexpectedly large values | Low |
Remediation
- High transaction latency: Check for large transactions, compaction backlogs, or lock contention. Smaller, more frequent transactions generally perform better.
- Many open transactions: Long-running transactions hold resources. Investigate whether clients are failing to commit or rollback.
Server Load Metrics
These metrics track CPU usage, request handling capacity, and thread pool health.
| Metric | What to Watch | Alert Threshold | Severity |
|---|---|---|---|
system.cpu.usage | CPU usage ratio | > 0.90 for > 5 min | High |
user.threads.active | Active user-facing threads | Sustained value near user.threads.size | High |
user.threads.queued | Queued user requests | Sustained increase over time | High |
admin.threads.active | Active admin threads | Sustained value near admin.threads.size | Medium |
admin.threads.queued | Queued admin requests | Sustained increase over time | Medium |
com.stardog.http.server-PORT.avgRequesttime.p99 | 99th percentile HTTP request time | > 2x your baseline for > 5 min | Medium |
com.stardog.http.server-PORT.currentRequests | In-flight HTTP requests | Sustained value near thread pool size | High |
jvm.os.file-descriptors.open | Open file descriptors | > 80% of jvm.os.file-descriptors.max | High |
File descriptor metrics are only available when metrics.jvm.enabled=true in stardog.properties.
Remediation
- High CPU usage: Identify expensive queries via the query log. Consider optimizing queries, reducing concurrency, or scaling out.
- Thread pool saturation: The server cannot keep up with incoming requests. Consider scaling out (add cluster nodes), reducing concurrent clients, or optimizing expensive queries.
- Request queuing: Requests are waiting for threads. This is often a symptom of slow queries or transactions. Investigate the root cause rather than simply increasing pool size.
- File descriptor exhaustion: Stardog uses file descriptors for database indices, client connections, and internal operations. Hitting the OS limit causes failures that can be difficult to diagnose. See Open Files Setting for configuration guidance.
License Metrics
| Metric | What to Watch | Alert Threshold | Severity |
|---|---|---|---|
dbms.license.expiration | Days until license expiration | < 30 days | Medium |
dbms.license.expiration | Days until license expiration | < 7 days | Urgent |
Remediation
- License expiration: Contact Stardog support to renew your license. An expired license will prevent the server from starting.
Summary: Essential Dashboard
For a quick-start monitoring dashboard, prioritize these metrics:
| Category | Key Metric | Why |
|---|---|---|
| Process memory | dbms.memory.system.usageRatio | Detect OOM risk |
| Disk space | dbms.home.free.space | Prevent disk-full conditions |
| Query performance | databases.YourDb.queries.latency.p95 | Track user-facing performance |
| Query throughput | databases.YourDb.queries.latency.m1_rate | Track request volume |
| Write performance | databases.YourDb.txns.latency.p95 | Track write latency |
| Thread saturation | user.threads.queued | Detect request queuing |
Alert thresholds in this document are general recommendations. You should establish baselines specific to your workload and environment, then set thresholds relative to those baselines. What constitutes “normal” varies significantly between deployments.