Link Search Menu Expand Document
Start for Free

Operating the Cluster

This chapter discusses operating a Stardog Cluster. This page discusses basic operations like adding nodes to a cluster and taking backups. More advanced topics like managing a distributed cache, standby nodes, read replica nodes, geo replica nodes, and troubleshooting the cluster are discussed in separate pages.

Page Contents
  1. Adding Nodes to the Cluster
  2. Shutting down the Cluster
  3. Upgrading the Cluster
    1. Upgrades
    2. Rolling Upgrades
  4. Backing up the Cluster
  5. Restoring the Cluster
  6. Repairing a Database
  7. Read-Only Mode
  8. ZooKeeper

Adding Nodes to the Cluster

Stardog Cluster stores the UUID of the last committed transaction for each database in ZooKeeper. When a new node is joining the cluster, it will compare the local transaction ID of each database with the corresponding transaction ID stored in ZooKeeper. If there is a mismatch, the node will synchronize the database contents from another node in the cluster. If there are no nodes in the cluster, the new node cannot join the cluster and will shut itself down. For this reason, if you are starting a new cluster, you should make sure the ZooKeeper state is cleared. If you are retaining an existing cluster, new nodes should be started when there is at least one node in the cluster.

If there are active transactions in the cluster, the joining node will wait for those transactions to finish and then synchronize its databases. More transactions may take place during synchronization, and in that case the joining node will continue synchronization and retrieve the data from new transactions. Thus, it will take longer for a node to join the cluster if there are continuous transactions. Note that the new node will not be available for requests until all the databases are synchronized. The proxy/load-balancer should perform a health check before forwarding the requests to a new node (as shown in our example HAProxy configuration), so user requests will always be forwarded to available nodes.

Please see the open transactions section of the troubleshooting guide for further guidance when investigating if open transactions are the cause of a node being unable to join.

Shutting down the Cluster

It is important to shut down the cluster in a safe and controlled manner in order to prevent possible inconsistencies or corruption.

Before shutting down the cluster, you should ensure there are no transactions running. stardog-admin db status <db name> will show if there are any open transactions for a database. This step is not strictly required; however, it can reduce the risk of open transactions being killed. This allows the cluster to stop quickly and helps avoid non-coordinator nodes from having to re-sync when they attempt to rejoin the cluster the next time it is started.

In order to shut down the cluster, you only need to execute the following command once:

$ stardog-admin cluster stop

The cluster stop request will cause all available nodes in the cluster to shut down. If a node was expelled from the cluster due to a failure, it would not receive this command and might need to be shut down manually. In order to shut down a single node in the cluster, use the regular server stop command and be sure to specify the server address:

$ stardog-admin --server http://localhost:5821 server stop

If you send the server stop command to the load balancer, a random node selected by the load balancer will shut down.

Monitor the logs and processes of all nodes to ensure the Stardog server process has stopped before restarting or terminating the underlying system.

If you are running Stardog in a container, you should stop the cluster via cluster stop or configure the shutdown timeout for your container runtime sufficiently high to avoid possible corruption. By default, container runtimes such as Docker or Kubernetes first send a SIGTERM before waiting a short amount of time (e.g. 10s-30s) and then sending a SIGKILL. In some instances, this can result in corruption if Stardog has not shut down cleanly before it is forcibly killed by the container runtime.

Upgrading the Cluster

The process to upgrade Stardog Cluster is straightforward; however, there are a few extra steps you should take to ensure the upgrade goes as quickly and smoothly as possible. Before you begin the upgrade, make sure to place the new Stardog binaries on all of the cluster nodes. Always remember to follow system administration best practices and take full backups of all Stardog nodes and ZooKeeper and verify your backups.

Upgrades

After backing up all nodes, note which node is the coordinator since this is the first node that will be started as part of the upgrade. stardog-admin cluster info will show the nodes in the cluster and which one is the coordinator.

Next, follow the process in the shutdown section and stop the cluster before continuing.

Once all nodes have stopped, back up the STARDOG_HOME directories on all of the nodes.

With the new version of Stardog, bring the cluster up one node at a time, starting with the previous coordinator. As each node starts, make sure it is able to join the cluster cleanly before moving on to the next node.

Rolling Upgrades

Stardog Cluster supports rolling upgrades for patch release versions. This means two different versions of Stardog can be running and part of the cluster at the same time, e.g., two nodes running 8.0.0 and one node running 8.0.1 while the cluster is being upgraded from 8.0.0 to 8.0.1.

To perform a rolling upgrade, you must first put the cluster into read-only mode.

Once the cluster is in read-only mode, you can begin upgrading the nodes one at a time. First, shut down the node you wish to upgrade, e.g.:

$ stardog-admin --server http://stardog1:5820 server stop

Be careful not to issue the cluster stop command to one of the nodes, or it will be replicated to all nodes and shut the entire cluster down.

Once the node has shut down, you can start the new version of Stardog on that node and wait for it to join the cluster. You can confirm that it has joined the cluster with the cluster info command.

When all nodes in the cluster have been upgraded, you can stop read-only mode and resume operations.

If the cluster is managed with Helm in Kubernetes, Helm can be used to perform the rolling upgrade. However, the cluster still must be in read-only mode for the rolling upgrade to work.

Rolling upgrades are not currently supported for major or minor releases, e.g., Stardog 8 to 9 or Stardog 8.0 to 8.1.

Backing up the Cluster

Backing up the cluster is similar to backing up single-node Stardog. However, there are a few points to be aware of. All nodes in the cluster will perform a backup unless S3 is the backup location, in which case only a single node will perform the backup to the S3 bucket.

If you are backing up to S3, the backup.location server property should be the same on all nodes in the cluster since any node may perform the backup.

You can also disable backup replication in the cluster by setting the following option:

pack.backups.replicated.scheme=none

If that is disabled, only the node that receives the command will perform the backup; otherwise all nodes in the cluster will run the backup (either db backup or server backup).

If backup.location is used to specify a backup directory mounted on the Stardog nodes, backup.location can specify different directories on each node in the cluster, if required. In this case you should disable replicated backups and issue the backup command to each node individually.

Finally, if --to is passed to the backup CLI commands, it will take precedence over either backup.dir or backup.location specified in stardog.properties, and all nodes will perform a backup to the specified location.

Always remember to follow best practices for backups. Transfer any backups out of the infrastructure where Stardog runs (ideally to a separate data center, if possible), and validate them in a separate environment before assuming they are correct.

Restoring the Cluster

Similar to single-node Stardog, you can restore individual databases with db restore. However, db restore from a file-based backup (not an S3 or GCP backup, for instance) can only be run on a single-node cluster to ensure the cluster remains consistent. You can follow the same instructions for scaling the cluster down and up as you would for repairing a database in the cluster. Restoring from non-file-based backups will be replicated to all nodes and can safely be run in multi-node clusters.

Because Stardog Cluster uses ZooKeeper to ensure strong consistency between all of the nodes in the cluster, we recommend server restore be done on only one node with a fresh deploy of ZooKeeper (i.e., clear ZooKeeper’s state once it is no longer in use).

The operational process is to:

  1. Shut down Stardog on all nodes in the cluster.
  2. Shut down the ZooKeeper ensemble, if possible. If that’s not possible, we recommend backing up ZooKeeper’s state and wiping the contents stored by Stardog.
  3. Create an empty $STARDOG_HOME directory on all of the Stardog Cluster nodes. For each cluster node, copy your Stardog license key and the stardog.properties file for that node into the empty directory.
  4. Export $STARDOG_HOME to the empty home and run server restore (the same as you would for a single node) on a single node.
  5. Start a fresh ZooKeeper ensemble with an empty data directory.
  6. Start ONLY the Stardog node where you performed server restore. Verify the node starts and is in the cluster with the cluster info command before continuing to step 7.
  7. Start a second node in the cluster with its empty home directory. Wait for it to sync and join the cluster, as reported by cluster info. Wait until the node joins before moving to step 8.
  8. Repeat step 7, one node at a time, for the remaining cluster nodes.

Repairing a Database

The db repair command can only be run on a single-node cluster to ensure the cluster remains consistent. Therefore, you must shut down the additional nodes before you can run repair. We recommend you terminate the non-coordinator nodes in your cluster by issuing the server stop command to each server individually. Terminating writes to the cluster before shutting down the nodes will help speed the process along. You can do this by offlining any databases with open write transactions or by putting the cluster into read-only mode.

Once you scale the cluster down to a single node, you can run the repair command. After it completes, you can start the other cluster nodes again, one at a time, allowing each to join before moving on to the next one until your cluster is back at full strength.

Read-Only Mode

Stardog Cluster can run in read-only mode as a preview feature starting in 7.7.0 and in general availability in 9.0. In read-only mode, the cluster will accept reads (such as listing roles and users) and read queries, and reject writes, including:

  • Data adds
  • Database creations and drops
  • Restoring backups
  • SPARQL update queries
  • User and role modifications
  • VG operations to add, remove, and refresh VGs

In addition to read operations, other operations that are permitted in read-only mode include:

  • Disable read-only mode
  • Shut down the cluster
  • Optimize
  • Kill a query
  • Backups
  • Generating the cluster diagnostic report

Read-only mode is a cluster-only feature. It is not available for single-node Stardog.

You can start read-only mode on a running cluster with the following command:

$ stardog-admin cluster readonly-start

If the command is unable to acquire the necessary locks due to other write transactions, you can stop them by offlining those databases and then starting read-only mode. Be advised that offlining a database will terminate all read and write queries for that database.

Stop read-only mode, returning the cluster to read/write, with the following command:

$ stardog-admin cluster readonly-stop

A user must be an admin or have the permission [EXECUTE, dbms-admin:read-only] to start and stop read-only mode.

ZooKeeper

ZooKeeper is independent from the Stardog Cluster nodes. Stardog 8.2+ supports ZooKeeper 3.6 and 3.7 in GA. ZooKeeper 3.5 is no longer supported as it was declared end of life on June 1, 2022. Previous versions of Stardog support different ZooKeeper versions, as noted in the release notes. Prior to Stardog 7.9.0, Stardog officially supports the latest version of ZooKeeper 3.4.

Please refer to the ZooKeeper documentation for more details on managing ZooKeeper and following its best practices.

Stardog uses Apache Curator as the client library for ZooKeeper. There is a known Curator bug that prevents it from re-resolving the IPs of the ZooKeeper instances if they change. This bug impacts versions of Stardog prior to 8.2. If you are running a version of Stardog < 8.2 and performing an operation that causes the ZooKeeper IPs to change, you must restart Stardog in order for it to properly reconnect to ZooKeeper. This can be particularly problematic in environments like Kubernetes where node upgrades, etc. may cause ZooKeeper pods to get new IPs. It is recommended that you shut down Stardog before performing any operations that will result in new IPs for ZooKeeper nodes.

To upgrade ZooKeeper, we recommend you first stop all writes to Stardog and then backup all Stardog and ZooKeeper nodes. Once the backup is complete, you should stop all Stardog nodes, following the instructions in the shutdown section.

After the cluster is stopped, you can upgrade ZooKeeper, either by performing a rolling upgrade of the ZooKeeper nodes (if allowed between the ZooKeeper versions you are upgrading) or by deploying a new, empty ZooKeeper ensemble with the new version.

If you deploy a new, empty ZooKeeper make sure to start a Stardog cluster node that contains all of your data. If you start a Stardog node that is out of date (e.g., if it was previously expelled from the cluster before you stopped writes and shut down the cluster), it’s possible you will lose data when the remaining Stardog cluster nodes start and sync.

Stardog cluster nodes can be started one at a time once the new version of ZooKeeper is running successfully on all ZooKeeper nodes.