Troubleshooting Stardog Cluster
This page describes helpful information for troubleshooting a Stardog Cluster.
Page Contents
Overview
This guide includes information to debug issues with a Stardog Cluster. For this guide to be helpful it’s important to familiarize yourself with the cluster architecture and guarantees. For the purpose of keeping this section focused on the cluster, it does not provide an in depth discussion of troubleshooting many of the kinds of issues that can go wrong on a single Stardog server, which can also be problematic on a node in a cluster.
Stardog Cluster Inspection
The first thing to do when the cluster is experiencing trouble is check a few basic items in order to get your bearings on the situation. This involves examining cluster membership to see if any nodes have been expelled or dropped, inspecting logs and metrics on each node, monitoring resource usage of each node, and checking for any common problems with timeouts that may be occurring.
To check if any nodes have been expelled or dropped from the cluster use the cluster info
command:
$ stardog-admin cluster info
Coordinator:
127.0.0.1:6000
Nodes:
127.0.0.1:6002
127.0.0.1:6001
The cluster status
command will also which nodes are currently members along with a brief overview of the server.
Logs
If any nodes are no longer members the first place to look is the end of the Stardog log in $STARDOG_HOME/stardog.log
on the node that is no longer in the cluster. That will often provide some clues for the reason the node is no longer a member.
The other log to check is the coordinator’s Stardog log to see if there are any messages about the node being expelled around the time in question. Those messages typically look like (in this case the node failed to drop the database so it’s expelled):
WARN 2021-04-26 18:06:11,050 [stardog-admin-4] com.complexible.stardog.pack.replication.tx.Replicate:expelFailedNodes(444): Action drop: Initiating node 10.244.11.25:5820: Failed node(s): [10.244.9.25:5820]
If no nodes report expelling failed nodes (similar to the log message above), then another message to look out for is one where a node drops from ZooKeeper. Those messages typically look like:
INFO 2021-05-03 10:21:00,019 [Curator-ConnectionStateManager-0] com.complexible.stardog.pack.replication.impl.zookeeper.ZkCluster:lambda$listenForConnectionEvents$3(516): Suspended (10.0.1.13:5820)
INFO 2021-05-03 10:21:00,019 [zkClusterEvent-3] com.complexible.stardog.pack.virtual.ReplicatedVirtualGraphRegistry:membershipStateChanges(638): Ignoring state change from ACTIVE to SUSPENDED
Nodes that are expelled by the coordinator will also show a similar log message when they transition from ACTIVE to SUSPENDED. If they are expelled there will be a message around the same time in the coordinator’s log. If not, then they lost the connection to ZooKeeper for another reason.
Often nodes are suspended due to connectivity issues with ZooKeeper because of resource problems. Either network connectivity or an overloaded CPU which is unable to schedule Curator threads frequently enough to maintain the heartbeats with ZooKeeper.
There are many different reasons a node may be expelled and not all of them are an indication that the cluster is broken, even if an individual node has trouble. The primary reason a node is expelled is to ensure the cluster remains consistent. As long as it doesn’t happen too frequently and nodes are able to rejoin, then the cluster may be behaving as expected. However, often times when a node is expelled manual intervention is required to remedy the situation so the node can recover (or be replaced) and rejoin.
Metrics
You can gather all of the metrics for the cluster with the cluster metrics
command. This will be metrics for all nodes in the cluster. The start of each node’s metrics will be proceeded by its address:
Node : 127.0.0.1:6001
The metrics that begin with cluster
are the primary metrics to check for cluster-specific issues and not just individual Stardog issues for nodes in the cluster (a topic beyond the scope of this cluster troubleshooting guide, please see the Stardog documentation for further discussion of the other metrics):
cluster.fullsync.attempts: 1
cluster.fullsync.failures: 0
cluster.fullsync.lastFailed.timestamp: 0
cluster.fullsync.lastSuccessful.timestamp: 1,619,637,875,761
cluster.fullsync.success.count: 1
cluster.sync.attempt.count: 1
cluster.sync.check.count : 1
cluster.sync.failure.count: 0
cluster.sync.lastFailed.timestamp: 0
cluster.sync.lastSuccessful.timestamp: 0
cluster.sync.running : 0
cluster.sync.success.count: 1
This can provide a good overview if a node has been dropping and having to rejoin the cluster which can be a sign of trouble and may merit further investigation.
Resource Usage
One common problem with nodes in the cluster can be related to resource usage. For example, because all nodes in the cluster have a copy of the data, any node can service a query. While we work hard to prevent queries from causing issues, it’s possible for a node to run a bad query which utilizes too many resources, causing the node to cease function properly, e.g., by losing its connection to ZooKeeper and dropping from the cluster. In this case the node may require manual intervention and have to be cleaned up, recovered, or replaced and then rejoin the cluster. If a Stardog node is experiencing memory pressure (from Stardog or other processes on the node) you will typically see memory warnings in its log once the free memory on the node drops below 10%:
INFO 2021-04-25 00:53:06,233 [memory-monitor] com.complexible.stardog.api.NativeMemoryMonitor:reportStatus(127): Memory usage 90% - initiating detailed analysis - will check again in 1m
INFO 2021-04-25 00:53:06,236 [memory-monitor] com.complexible.stardog.api.NativeMemoryMonitor:reportStatus(128): Stardog JVM memory usage
Stardog does not log similar warnings for high CPU use or a full disk, so it’s important to monitor those resources as well and inspect them if a node is having trouble.
Timeouts
Other common issues to look out are for any timeouts, especially when performing any large data loads or executing any long running queries. It’s possible you may need to adjust various timeout settings or change certain aspects of the operation you’re attempting in order to complete it before it times out. Support can help you determine which setting may need changing depending on the nature of the timeout.
Because Stardog HTTP clients are blocking there are cases where long running operations, such as a large bulk load or copying a large virtual graph to a named graph using SPARQL COPY
, may timeout on the client but continue running on the server when connecting through a load balancer. This typically happens because the load balancer idle timeout is set too low and the operation runs for longer than the load balancer allows. When this occurs the load balancer severs the connection to the client. You can sometimes workaround this by increasing the idle timeout on the load balancer. However, if the idle timeout is already at the max then the only course of action is to monitor progress in $STARDOG_HOME/stardog.log
until the operation completes on the cluster. At that point the operation should be successful and the data should be available even though the client receives a timeout. Depending on the load balancer this may manifest in different ways but often you’ll receive a GATEWAY_TIMEOUT
error message from the client in the middle of the operation.
Stardog Cluster in Kubernetes
Managing Stardog Cluster in Kubernetes (k8s) comes with benefits as well as pitfalls that can be surprising to admins who are less familiar with k8s.
We recommend that you deploy and manage Stardog in k8s with our open source helm charts.
Readiness and Liveness Probes
K8s uses readiness probes to determine which nodes it should route traffic to and liveness probes to determine if a node has failed and should be restarted.
Stardog provides an HTTP readiness endpoint at /admin/healthcheck
which will return 200 if a node is currently a cluster member that has fully joined. The HTTP liveness endpoint is /admin/alive
and denotes whether or not the node is alive, even if it hasn’t yet joined the cluster. A node may return true for an alive probe when it is syncing data to join the cluster and false for the readiness probe because it hasn’t yet sync’d all of the data and completed the join.
If Stardog is unable to respond to the liveness probe k8s may restart the node after a few failures. In general this should be fine as a node may have died for any number of reasons that a restart can help fix. However, if the liveness settings are set too low and require too few failures before restarting, k8s may restart cluster nodes when it is not needed and cause issues with the cluster. If the timeouts are too low and a node is simply heavily loaded due to an expensive data load or query, it’s possible that Stardog may not be able to respond to the liveness check in time and k8s will restart the node. Once this happens it can cause additional issues in the cluster since the node may introduce more pressure on the other nodes as it attempts to sync and join.
It’s important to make sure your liveness probe settings are high enough for your workload to prevent unnecessary restarts by k8s. These can be adjusted in the helm chart values.yaml
file as follows:
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 15
Clean Shutdown
It’s important to always shutdown Stardog servers cleanly in order to avoid data corruption. In VMs or bare metal environments this typically isn’t an issue because the shutdown command can be issued and wait as long as needed for Stardog to wrap up and shutdown.
By default k8s will wait 30 seconds for a graceful shutdown before forcefully terminating the pod. In some circumstances this can cause data corruption if k8s issues a SIGKILL before Stardog has completed its shutdown. In general it is best practice to restart Stardog in k8s by issuing shutdown to the Stardog server directly with server stop
or to all cluster nodes with cluster stop
instead of deleting the Stardog pods with kubectl
or helm commands. Once Stardog stops, k8s will recognize this and start it again.
In cases where you need to scale the cluster down and completely stop Stardog pods, make sure no queries are running, stop all write traffic to the cluster (if possible), and increase the graceful shutdown period in k8s to a sufficiently high value to prevent data corruption.
Helpful k8s commands for debugging
When experiencing issues with a cluster in k8s it’s important to gather all of the Stardog specific information outlined in the getting support section as well as information from k8s.
An overview of the Stardog and ZooKeeper pods which will identify the pods that have restarted and how many times and if they are currently passing the readiness check:
kubectl -n <namespace> get pods
Describe each of the Stardog and ZooKeeper pods, especially any that are failing the readiness check or have restarted:
kubectl -n <namespace> describe pod <pod name>
Gather logs for all pods:
kubectl -n <namespace> logs <pod name>
Gather events for all resources in the namespace:
kubectl -n <namesapce> get events --sort-by=.metadata.creationTimestamp
To help you gather information or inspect inside one of the pods, you can exec into it:
kubectl exec -n <namespace> -it <pod name> -- /bin/bash
More detailed information about helpful k8s commands for debugging can be found in the k8s docs.