This page describes helpful information for troubleshooting a Stardog Cluster.
This guide includes information to debug issues with a Stardog Cluster. For this guide to be helpful it’s important to familiarize yourself with the cluster architecture and guarantees. For the purpose of keeping this section focused on the cluster, it does not provide an in depth discussion of troubleshooting many of the kinds of issues that can go wrong on a single Stardog server, which can also be problematic on a node in a cluster.
The first thing to do when the cluster is experiencing trouble is check a few basic items in order to get your bearings on the situation. This involves examining cluster membership to see if any nodes have been expelled or dropped, inspecting logs and metrics on each node, monitoring resource usage of each node, and checking for any common problems with timeouts that may be occurring.
To check if any nodes have been expelled or dropped from the cluster use the
cluster info command:
$ stardog-admin cluster info Coordinator: 127.0.0.1:6000 Nodes: 127.0.0.1:6002 127.0.0.1:6001
cluster status command will also which nodes are currently members along with a brief overview of the server.
If any nodes are no longer members the first place to look is the end of the Stardog log in
$STARDOG_HOME/stardog.log on the node that is no longer in the cluster. That will often provide some clues for the reason the node is no longer a member.
The other log to check is the coordinator’s Stardog log to see if there are any messages about the node being expelled around the time in question. Those messages typically look like (in this case the node failed to drop the database so it’s expelled):
WARN 2021-04-26 18:06:11,050 [stardog-admin-4] com.complexible.stardog.pack.replication.tx.Replicate:expelFailedNodes(444): Action drop: Initiating node 10.244.11.25:5820: Failed node(s): [10.244.9.25:5820]
If no nodes report expelling failed nodes (similar to the log message above), then another message to look out for is one where a node drops from ZooKeeper. Those messages typically look like:
INFO 2021-05-03 10:21:00,019 [Curator-ConnectionStateManager-0] com.complexible.stardog.pack.replication.impl.zookeeper.ZkCluster:lambda$listenForConnectionEvents$3(516): Suspended (10.0.1.13:5820) INFO 2021-05-03 10:21:00,019 [zkClusterEvent-3] com.complexible.stardog.pack.virtual.ReplicatedVirtualGraphRegistry:membershipStateChanges(638): Ignoring state change from ACTIVE to SUSPENDED
Nodes that are expelled by the coordinator will also show a similar log message when they transition from ACTIVE to SUSPENDED. If they are expelled there will be a message around the same time in the coordinator’s log. If not, then they lost the connection to ZooKeeper for another reason.
Often nodes are suspended due to connectivity issues with ZooKeeper because of resource problems. Either network connectivity or an overloaded CPU which is unable to schedule Curator threads frequently enough to maintain the heartbeats with ZooKeeper.
There are many different reasons a node may be expelled and not all of them are an indication that the cluster is broken, even if an individual node has trouble. The primary reason a node is expelled is to ensure the cluster remains consistent. As long as it doesn’t happen too frequently and nodes are able to rejoin, then the cluster may be behaving as expected. However, often times when a node is expelled manual intervention is required to remedy the situation so the node can recover (or be replaced) and rejoin.
One common cause of cluster nodes being unable to join the cluster is an open transaction. Often this is caused by legitimate user workloads. Writes to the cluster will typically block the node from joining until writes subside. The database setting
database.connection.timeout will also terminate any idle open transactions, after which nodes will be able to join.
If the writes are legitimate and you’d like to stop the writes and allow the node to join. You can either offline the database with open transactions or put the cluster into read only mode. After the queries terminate the node should be able to join.
Offlining a database will terminate all read and write queries for that database.
However, if you believe there are no writes to the cluster and the node is still unable to join then you should inspect the databases for idle open transactions.
You can see which databases have open transactions with:
stardog-admin cluster status
And you can list open transactions on a database with:
stardog tx list <database>
You should use a superuser to list open transactions on a database otherwise you will only see open transactions that the current user has permission to see.
If you find an idle transaction, you can roll it back with:
stardog tx rollback <database> <txId>
After the transaction has been removed, the node should be able to join.
The rollback operation may not complete immediately if the transaction is in the middle of a long running operation.
If you have followed all of these steps and the node is still unable to obtain the lock to join, first double check that all writes to the cluster have been stopped and that there are no new transactions to any databases. After you confirm that there are no writes to the cluster, you can inspect the transactions in ZooKeeper with:
stardog-admin zk info
If ZooKeeper shows a list of admin actions or transactions but Stardog does not report any open transactions and a user is not running an admin action, you’ll need to remove the locks from ZooKeeper before the node can join. Restarting any nodes that hold the locks for admin actions or transactions should clear them from ZooKeeper and allow the node to join. The output will show the owner of each lock, which will correspond to a node in the members list of
zk info. Please report this to support so Stardog Engineering can investigate. Provide as much information as you can from the steps above, including detailed examples of workloads and admin operations you’ve run and when, as well as the information outlined in the cluster section.
You can gather all of the metrics for the cluster with the
cluster metrics command. This will be metrics for all nodes in the cluster. The start of each node’s metrics will be proceeded by its address:
Node : 127.0.0.1:6001
The metrics that begin with
cluster are the primary metrics to check for cluster-specific issues and not just individual Stardog issues for nodes in the cluster (a topic beyond the scope of this cluster troubleshooting guide, please see the Stardog documentation for further discussion of the other metrics):
cluster.fullsync.attempts: 1 cluster.fullsync.failures: 0 cluster.fullsync.lastFailed.timestamp: 0 cluster.fullsync.lastSuccessful.timestamp: 1,619,637,875,761 cluster.fullsync.success.count: 1 cluster.sync.attempt.count: 1 cluster.sync.check.count : 1 cluster.sync.failure.count: 0 cluster.sync.lastFailed.timestamp: 0 cluster.sync.lastSuccessful.timestamp: 0 cluster.sync.running : 0 cluster.sync.success.count: 1
This can provide a good overview if a node has been dropping and having to rejoin the cluster which can be a sign of trouble and may merit further investigation.
One common problem with nodes in the cluster can be related to resource usage. For example, because all nodes in the cluster have a copy of the data, any node can service a query. While we work hard to prevent queries from causing issues, it’s possible for a node to run a bad query which utilizes too many resources, causing the node to cease function properly, e.g., by losing its connection to ZooKeeper and dropping from the cluster. In this case the node may require manual intervention and have to be cleaned up, recovered, or replaced and then rejoin the cluster. If a Stardog node is experiencing memory pressure (from Stardog or other processes on the node) you will typically see memory warnings in its log once the free memory on the node drops below 10%:
INFO 2021-04-25 00:53:06,233 [memory-monitor] com.complexible.stardog.api.NativeMemoryMonitor:reportStatus(127): Memory usage 90% - initiating detailed analysis - will check again in 1m INFO 2021-04-25 00:53:06,236 [memory-monitor] com.complexible.stardog.api.NativeMemoryMonitor:reportStatus(128): Stardog JVM memory usage
Stardog does not log similar warnings for high CPU use or a full disk, so it’s important to monitor those resources as well and inspect them if a node is having trouble.
Other common issues to look out are for any timeouts, especially when performing any large data loads or executing any long running queries. It’s possible you may need to adjust various timeout settings or change certain aspects of the operation you’re attempting in order to complete it before it times out. Support can help you determine which setting may need changing depending on the nature of the timeout.
Because Stardog HTTP clients are blocking there are cases where long running operations, such as a large bulk load or copying a large virtual graph to a named graph using SPARQL
COPY, may timeout on the client but continue running on the server when connecting through a load balancer. This typically happens because the load balancer idle timeout is set too low and the operation runs for longer than the load balancer allows. When this occurs the load balancer severs the connection to the client. You can sometimes workaround this by increasing the idle timeout on the load balancer. However, if the idle timeout is already at the max then the only course of action is to monitor progress in
$STARDOG_HOME/stardog.log until the operation completes on the cluster. At that point the operation should be successful and the data should be available even though the client receives a timeout. Depending on the load balancer this may manifest in different ways but often you’ll receive a
GATEWAY_TIMEOUT error message from the client in the middle of the operation.
Managing Stardog Cluster in Kubernetes (k8s) comes with benefits as well as pitfalls that can be surprising to admins who are less familiar with k8s.
We recommend that you deploy and manage Stardog in k8s with our open source helm charts.
Stardog Cluster uses persistent volumes in k8s to maintain state, both for Stardog itself as well as ZooKeeper.
Deleting persistent volume claims for either Stardog or ZooKeeper may result in data loss. Only do so with extreme caution and after verifying that you have complete backups located outside of the k8s cluster.
K8s uses readiness probes to determine which nodes it should route traffic to and liveness probes to determine if a node has failed and should be restarted.
Stardog provides an HTTP readiness endpoint at
/admin/healthcheck which will return 200 if a node is currently a cluster member that has fully joined. The HTTP liveness endpoint is
/admin/alive and denotes whether or not the node is alive, even if it hasn’t yet joined the cluster. A node may return true for an alive probe when it is syncing data to join the cluster and false for the readiness probe because it hasn’t yet sync’d all of the data and completed the join.
If Stardog is unable to respond to the liveness probe k8s may restart the node after a few failures. In general this should be fine as a node may have died for any number of reasons that a restart can help fix. However, if the liveness settings are set too low and require too few failures before restarting, k8s may restart cluster nodes when it is not needed and cause issues with the cluster. If the timeouts are too low and a node is simply heavily loaded due to an expensive data load or query, it’s possible that Stardog may not be able to respond to the liveness check in time and k8s will restart the node. Once this happens it can cause additional issues in the cluster since the node may introduce more pressure on the other nodes as it attempts to sync and join.
It’s important to make sure your liveness probe settings are high enough for your workload to prevent unnecessary restarts by k8s. These can be adjusted in the helm chart
values.yaml file as follows:
livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 timeoutSeconds: 15
It’s important to always shutdown Stardog servers cleanly in order to avoid data corruption. In VMs or bare metal environments this typically isn’t an issue because the shutdown command can be issued and wait as long as needed for Stardog to wrap up and shutdown.
By default k8s will wait 30 seconds for a graceful shutdown before forcefully terminating the pod. In some circumstances this can cause data corruption if k8s issues a SIGKILL before Stardog has completed its shutdown. In general it is best practice to restart Stardog in k8s by issuing shutdown to the Stardog server directly with
server stop or to all cluster nodes with
cluster stop instead of deleting the Stardog pods with
kubectl or helm commands. Once Stardog stops, k8s will recognize this and start it again.
In cases where you need to scale the cluster down and completely stop Stardog pods, make sure no queries are running, stop all write traffic to the cluster (if possible), and increase the graceful shutdown period in k8s to a sufficiently high value to prevent data corruption.
When experiencing issues with a cluster in k8s it’s important to gather all of the Stardog specific information outlined in the getting support section as well as information from k8s.
An overview of the Stardog and ZooKeeper pods which will identify the pods that have restarted and how many times and if they are currently passing the readiness check:
kubectl -n <namespace> get pods
Describe each of the Stardog and ZooKeeper pods, especially any that are failing the readiness check or have restarted:
kubectl -n <namespace> describe pod <pod name>
Gather logs for all pods:
kubectl -n <namespace> logs <pod name>
Gather events for all resources in the namespace:
kubectl -n <namesapce> get events --sort-by=.metadata.creationTimestamp
To help you gather information or inspect inside one of the pods, you can exec into it:
kubectl exec -n <namespace> -it <pod name> -- /bin/bash
More detailed information about helpful k8s commands for debugging can be found in the k8s docs.