This section discusses our recommendations for deploying a Stardog High Availability Cluster in a production environment.
ZooKeeper is a cricital component of a Stardog HA cluster, providing metadata persistence1 and coordination services to the nodes in the cluster. If you deploy your own Stardog HA Cluster, you must fully understand how to configure and maintain a ZooKeeper ensemble.
See the ZooKeeper documentation for more details. In particular, understanding ZooKeeper’s system requirements is critical.
As a starting point for provisioning a server to host a ZooKeeper node, the Required Software section of the Administrator’s Guide states: “At Yahoo!, ZooKeeper is usually deployed on dedicated RHEL boxes, with dual-core processors, 2GB of RAM, and 80GB IDE hard drives.”
Whether you run the ZooKeeper ensemble for your Stardog HA cluster on physical hardware or in a virtualized environment such as Amazon Web Services (AWS), we recommend the following:
- A ZooKeeper ensemble must consist of an odd number of nodes. Three nodes is the minimum recommended configuration.
- Run each ZooKeeper node process on its own server. In this context, “server” refers to an OS host running on physical hardware or on a virtual machine.
- Do NOT share the server running a ZooKeeper process with any other applications2, including a Stardog node process.
- Minimize dependencies between the servers hosting each ZooKeeper node process. For example, when running on VMs in AWS, deploy each VM to a different Availability Zone.
- Ensure the network between the nodes in the ZooKeeper ensemble is low-latency and reliable (e.g., a LAN in the same data center, not a WAN).
- Similarly, ensure the network between the ZooKeeper ensemble and the nodes in the Stardog cluster is low-latency and reliable.
- Place the ZooKeeper data and log directories on separate disks3, and ensure the disks provide sufficient QoS guarantees4.
- Follow the guidelines in the Things to Avoid section of the ZooKeeper configuration documentation.
In addition to a ZooKeeper ensemble, a Stardog HA cluster consists of one or more Stardog server nodes (a minimum of three is recommended). A client typically accesses the cluster via a load balancer. See the “Stardog Cluster Architecture” section of the Tuning Cluster for Cloud blog post for a good overview of a Stardog HA cluster.
Many of the recommendations for configuring the Stardog server nodes in a cluster are similar to those for ZooKeeper. We provide a Capacity Planning doc page to assist with provisioning CPU, memory, and disk. Note however that these recommendations are for a stand-alone Stardog server. When provisioning a host for a Stardog cluster node, increase the provisioning requirements by 25-50% to account for the overhead of a node’s cluster membership participation.
For guidance on provisioning a VM to host a Stardog server node in the cloud, see the “AWS specifics” section of the previously-mentioned blog post.
Whether you run the Stardog server nodes for your cluster on physical hardware or in a virtualized environment, we recommend the following:
- Three Stardog nodes is the minimum recommended configuration for a production Stardog HA cluster5.
- Run each Stardog node process on its own server. In this context, “server” refers to an OS host running on physical hardware or on a virtual machine.
- Do NOT share the server running a Stardog process with any other applications, including a ZooKeeper node process.
- Minimize dependencies between the servers hosting each Stardog node process. For example, when running on VMs in AWS, deploy each VM to a different Availability Zone.
- Ensure the network between the nodes in the Stardog cluster is low-latency and reliable (e.g., a LAN in the same data center, not a WAN).
- Similarly, ensure the network between the Stardog cluster and its ZooKeeper ensemble is low-latency and reliable.
- Allocate a volume for the JVM’s temporary directory that is at least as large and performant as the
- ZooKeeper Administrator’s Guide
- Our blog post Tuning Cluster for Cloud discusses a variety of factors that you should consider when deploying a Stardog HA Cluster as well as some adjustments you should make depending on your workload. While the title mentions “Cloud”, the recommendations are largely applicable to any Stardog HA Cluster deployment.
- Stardog Capacity Planning doc page
ZooKeeper is used to persist the latest Stardog transaction ID, and it is used to store metadata about the Stardog cluster. However, ZooKeeper is not used to store any Stardog database data, such as the contents of a transaction. ↩
The Required Software section of the Administrator’s Guide states: “Three ZooKeeper servers is the minimum recommended size for an ensemble, and we also recommend that they run on separate machines.” ↩
The Single Machine Requirements section of the ZooKeeper Administrator’s Guide states: “ZooKeeper’s transaction log must be on a dedicated device. (A dedicated partition is not enough.)” ↩
Some cloud and/or container environments offer varying QoS guarantees for their disks, which can impact ZooKeeper’s performance as well as overall cluster reliability. Specifically, if ZooKeeper is not able to persist information to disk, across all of its nodes, within its timeout limits, then sessions may be dropped and Stardog nodes may be expelled from the cluster.
For example, on AWS, you can use general purpose volumes (
gp3) for a lightly loaded ZooKeeper cluster, but for production use, we recommend using provisioned IOPS volumes (
io2) so that credit-limiting does not increase disk latency. If you decide to deploy your ZooKeeper cluster with general purpose disk volumes, be sure to measure the cluster’s performance under expected workloads. See the “AWS specifics” section of the Tuning Cluster for Cloud blog post for more information. ↩
The number of Stardog nodes is not required to be an odd number, since we use ZooKeeper to manage the decisions about cluster membership. The number of Stardog nodes you deploy depends on your HA requirements (i.e., how many nodes can fail without killing the cluster) and how much client load must be supported. ↩