Troubleshooting Stardog Cluster

This page describes helpful information for troubleshooting a Stardog Cluster.

Page Contents

Overview
Stardog Cluster Inspection
Stardog Cluster in Kubernetes
Known Issues

Overview

This guide includes information to debug issues with a Stardog Cluster. For this guide to be helpful, it’s important to familiarize yourself with the cluster architecture and guarantees. For the purpose of keeping this section focused on the cluster, it does not provide an in-depth discussion of troubleshooting many of the issues that can go wrong on a single Stardog server, which can also be problematic on a node in a cluster.

Stardog Cluster Inspection

The first thing to do when the cluster is experiencing trouble is check a few basic items to get your bearings on the situation. This involves examining cluster membership to see if any nodes have been expelled or dropped, inspecting logs and metrics on each node, monitoring resource usage of each node, and checking for any common problems with timeouts that may be occurring.

To check if any nodes have been expelled or dropped from the cluster, use the cluster info command:

$ stardog-admin cluster info
Coordinator:
    127.0.0.1:6000
Nodes:
    127.0.0.1:6002
    127.0.0.1:6001

The cluster status command will also show which nodes are currently members, along with a brief overview of the server.

Logs

If a node is no longer a member of the cluster, the first place to look is the end of the Stardog log in $STARDOG_HOME/stardog.log on that node. That will often provide clues as to why the node is no longer a member.

The other log to check is the coordinator’s Stardog log. This will help you see if there are any messages about the node being expelled around the time in question. Those messages typically look like:

WARN  2021-04-26 18:06:11,050 [stardog-admin-4] com.complexible.stardog.pack.replication.tx.Replicate:expelFailedNodes(444): Action drop: Initiating node 10.244.11.25:5820: Failed node(s): [10.244.9.25:5820]

In this case, the node failed to drop the database, so it was expelled.

If no nodes report expelling failed nodes (similar to the log message above), another message to look out for is one where a node drops from ZooKeeper. Those messages typically look like:

INFO  2021-05-03 10:21:00,019 [Curator-ConnectionStateManager-0] com.complexible.stardog.pack.replication.impl.zookeeper.ZkCluster:lambda$listenForConnectionEvents$3(516): Suspended (10.0.1.13:5820)
INFO  2021-05-03 10:21:00,019 [zkClusterEvent-3] com.complexible.stardog.pack.virtual.ReplicatedVirtualGraphRegistry:membershipStateChanges(638): Ignoring state change from ACTIVE to SUSPENDED

Nodes that are expelled by the coordinator will also show a similar log message when they transition from ACTIVE to SUSPENDED. If they are expelled, there will be a message around the same time in the coordinator’s log. If not, they lost connection to ZooKeeper for another reason.

Often, nodes are suspended due to connectivity issues with ZooKeeper because of resource problems. This can be network connectivity or an overloaded CPU which is unable to schedule Curator threads frequently enough to maintain the heartbeats with ZooKeeper.

There are many different reasons a node may be expelled, and not all of them are an indication the cluster is broken (even if an individual node has trouble). The primary reason a node is expelled is to ensure the cluster remains consistent. As long as it doesn’t happen too frequently and nodes are able to rejoin, the cluster may be behaving as expected. However, often when a node is expelled, manual intervention is required to remedy the situation so the node can recover (or be replaced) and rejoin.

Open Transactions

One common cause of cluster nodes being unable to join the cluster is an open transaction. Often this is caused by legitimate user workloads. A node will typically be blocked from joining the cluster until writes subside. The database setting database.connection.timeout will also terminate any idle open transactions, after which nodes will be able to join.

If the writes are legitimate but you’d like to stop them and allow the node to join, you can either offline the database with open transactions or put the cluster into read-only mode. After the queries terminate, the node should be able to join.

Offlining a database will terminate all read and write queries for that database.

However, if you believe there are no writes to the cluster and the node is still unable to join, you should inspect your databases for open idle transactions.

You can see which databases have open transactions with:

stardog-admin cluster status

And you can list open transactions on a database with:

stardog tx list <database>

You should use a superuser to list open transactions on a database; otherwise, you will only see the open transactions the current user has permission to see.

If you find an idle transaction, you can roll it back with:

stardog tx rollback <database> <txId>

After the transaction has been removed, the node should be able to join.

The rollback operation may not complete immediately if the transaction is in the middle of a long-running operation.

If you have followed all of these steps and the node is still unable to obtain the lock to join, first double check that all writes to the cluster have been stopped and that there are no new transactions to any databases. After you confirm there are no writes to the cluster, you can inspect the transactions in ZooKeeper with:

stardog-admin zk info

If ZooKeeper shows a list of transactions or admin actions but Stardog does not report any open transactions and a user is not running an admin action, you’ll need to remove the locks from ZooKeeper before the node can join. Restarting any nodes that hold the locks for transactions or admin actions should clear them from ZooKeeper and allow the node to join. The output will show the owner of each lock, which will correspond to a node in the members list of zk info. Please report this to support so Stardog Engineering can investigate. Provide as much information as you can from the steps above, including detailed examples of workloads and admin operations you’ve run and when, as well as the information outlined in the cluster section.

Metrics

You can gather all of the metrics for the cluster with the cluster metrics command. This will display metrics for all nodes in the cluster. The start of each node’s metrics will be proceeded by its address:

Node                     : 127.0.0.1:6001

The metrics that begin with cluster are the primary metrics to check for cluster-specific issues and not just Stardog issues for individual nodes. (The latter is a topic beyond the scope of this cluster troubleshooting guide. Please see Server Monitoring for further discussion of the other metrics):

cluster.fullsync.attempts: 1
cluster.fullsync.failures: 0
cluster.fullsync.lastFailed.timestamp: 0
cluster.fullsync.lastSuccessful.timestamp: 1,619,637,875,761
cluster.fullsync.success.count: 1
cluster.sync.attempt.count: 1
cluster.sync.check.count : 1
cluster.sync.failure.count: 0
cluster.sync.lastFailed.timestamp: 0
cluster.sync.lastSuccessful.timestamp: 0
cluster.sync.running     : 0
cluster.sync.success.count: 1

This can provide a good overview if a node has been dropping and rejoining the cluster, which can be a sign of trouble and may merit further investigation.

Resource Usage

One common problem with nodes in the cluster can be related to resource usage. For example, because all nodes in the cluster have a copy of the data, any node can service a query. While we work hard to prevent queries from causing issues, it’s possible for a node to run a bad query which utilizes too many resources. This causes the node to cease functioning properly, e.g., by losing its connection to ZooKeeper and dropping from the cluster. In this case, the node may require manual intervention and have to be cleaned up, recovered, or replaced and then rejoin the cluster. If a Stardog node is experiencing memory pressure (from Stardog or other processes on the node), you will typically see memory warnings in its log once the free memory on the node drops below 10%:

INFO  2021-04-25 00:53:06,233 [memory-monitor] com.complexible.stardog.api.NativeMemoryMonitor:reportStatus(127): Memory usage 90% - initiating detailed analysis - will check again in 1m
INFO  2021-04-25 00:53:06,236 [memory-monitor] com.complexible.stardog.api.NativeMemoryMonitor:reportStatus(128): Stardog JVM memory usage

Stardog does not log similar warnings for high CPU use or a full disk, so it’s important to monitor those resources as well and inspect them if a node is having trouble.

Timeouts

Another common issue to look out for is timeouts, especially when performing large data loads or executing long running queries. You may need to adjust various timeout settings or change certain aspects of the operation you’re attempting in order to complete it before it times out. Support can help you determine which setting needs changing depending on the nature of the timeout.

Because Stardog HTTP clients are blocking, there are cases where long running operations (such as a large bulk load or copying a large virtual graph to a named graph using SPARQL COPY) may time out on the client but continue running on the server when connecting through a load balancer. This typically happens because the load balancer idle timeout is set too low and the operation runs for longer than the load balancer allows. When this occurs, the load balancer severs the connection to the client. You can sometimes work around this by increasing the idle timeout on the load balancer. However, if the idle timeout is already at the max, the only course of action is to monitor progress in $STARDOG_HOME/stardog.log until the operation completes on the cluster. At that point, the operation should be successful, and the data should be available even though the client receives a timeout. Depending on the load balancer, this may manifest in different ways, but often you’ll receive a GATEWAY_TIMEOUT error message from the client in the middle of the operation.

Stardog Cluster in Kubernetes

Managing Stardog Cluster in Kubernetes (k8s) comes with benefits, as well as pitfalls that can be surprising to admins who are less familiar with k8s.

We recommend you deploy and manage Stardog in k8s with our open source helm charts.

Stardog Cluster uses persistent volumes in k8s to maintain state, both for Stardog itself and for ZooKeeper.

Deleting persistent volume claims for either Stardog or ZooKeeper may result in data loss. Only do so with extreme caution and after verifying that you have complete backups located outside of the k8s cluster.

Readiness and Liveness Probes

K8s uses readiness probes to determine which nodes it should route traffic to and liveness probes to determine if a node has failed and should be restarted.

Stardog provides an HTTP readiness endpoint at /admin/healthcheck, which will return 200 if a node is currently a cluster member that has fully joined. The HTTP liveness endpoint is /admin/alive and denotes whether or not the node is alive, even if it hasn’t yet joined the cluster. A node may return true for an alive probe when it is syncing data to join the cluster and false for the readiness probe because it hasn’t yet sync’d all of the data and completed the join.

If Stardog is unable to respond to the liveness probe, k8s may restart the node after a few failures. In general this should be fine, as a node may have died for any number of reasons that a restart can help fix. However, if the liveness settings are set too low and require too few failures before restarting, k8s may restart cluster nodes when it is not needed and cause issues with the cluster. If the timeouts are too low and a node is simply heavily loaded due to an expensive data load or query, it’s possible that Stardog may not be able to respond to the liveness check in time, and k8s will restart the node. Once this happens, it can cause additional issues in the cluster since the node may introduce more pressure on the other nodes as it attempts to sync and join.

It’s important to make sure your liveness probe settings are high enough for your workload to prevent unnecessary restarts by k8s. These can be adjusted in the helm chart values.yaml file as follows:

livenessProbe:
  initialDelaySeconds: 30
  periodSeconds: 30
  timeoutSeconds: 15

Clean Shutdown

It’s important to always shut down Stardog servers cleanly in order to avoid data corruption. In VMs or bare metal environments, this typically isn’t an issue because the shutdown command can be issued and wait as long as needed for Stardog to wrap up and shut down.

By default, k8s will wait 30 seconds for a graceful shutdown before forcefully terminating the pod. In some circumstances, this can cause data corruption if k8s issues a SIGKILL before Stardog has completed its shutdown. In general, it is best practice to restart Stardog in k8s by issuing shutdown to the Stardog server directly with server stop or to all cluster nodes with cluster stop instead of deleting the Stardog pods with kubectl or helm commands. Once Stardog stops, k8s will recognize this and start it again.

In cases where you need to scale the cluster down and completely stop Stardog pods, make sure no queries are running, stop all write traffic to the cluster (if possible), and increase the graceful shutdown period in k8s to a sufficiently high value to prevent data corruption.

Helpful k8s commands for debugging

When experiencing issues with a cluster in k8s, it’s important to gather all of the Stardog-specific information outlined in the getting support section, as well as information from k8s.

Get an overview of the Stardog and ZooKeeper pods, which will identify the pods that have restarted, how many times they have done so, and if they are currently passing the readiness check:

kubectl -n <namespace> get pods

Describe each of the Stardog and ZooKeeper pods, especially any that are failing the readiness check or have restarted:

kubectl -n <namespace> describe pod <pod name>

Gather logs for all pods:

kubectl -n <namespace> logs <pod name>

Gather events for all resources in the namespace:

kubectl -n <namesapce> get events --sort-by=.metadata.creationTimestamp

To help you gather information or inspect inside one of the pods, you can exec into it:

kubectl exec -n <namespace> -it <pod name> -- /bin/bash

More detailed information about helpful k8s commands for debugging can be found in the k8s docs.

Known Issues

Below is a list of known issues that can be problematic in the cluster.

Enabling abort on conflict (AoC) write strategy may cause nodes to be expelled if the workload experiences write conflicts. AoC is a non-default write conflict strategy that can be enabled for the storage layer. In this case, transactions are aborted if write conflicts are detected. When this occurs, nodes may be expelled causing cluster stability issues. At this time we do not recommend using AoC with the cluster, especially if write conflicts are possible with your workload. Instead, we recommend using the default write conflict strategy: last commit wins. Using AoC is primarily of interest if you use edge properties since AoC is required for that feature. We expect the cluster stability issues when using this feature to be fixed in a future release.
It is possible to get the cluster into an inconsistent state with nodes that are misconfigured for data sources and virtual graphs. For example, if 1 node does not have the correct driver for a data source, such as MySQL, then any MySQL data source or virtual graph would be offline on that node while the other nodes are online and able to access MySQL. The online/offline state would be apparent with the data-source list and virtul list commands when run against the individual nodes in the cluster. When a cluster is in this state, any queries that attempt to use the data source on working nodes would produce correct results whereas those same queries run against misconfigured nodes would provide empty results.

Overview
Stardog Cluster Inspection
Stardog Cluster in Kubernetes
Known Issues