Spark on Kubernetes Configuration

This page discusses how to configure Spark on Kubernetes as an external compute platform.

Page Contents

How it works
Deployment models
Configuration
SparkApplication template
Required cluster prerequisites

This page describes how to configure Spark on Kubernetes, run through the Kubeflow Spark Operator, as an external compute platform. For an introduction to external compute, see the external compute overview.

You point Stardog at a Kubernetes cluster with a compute properties file. This file is passed to the CLI as the -c or --compute option described in the virtual graph materialization and entity resolution sections of external compute. The file holds everything Stardog needs to reach your cluster: the cluster connection details, how to authenticate, and where to find the job template.

The sections below describe how the integration works, the two deployment models, and every property the compute file accepts.

Spark on Kubernetes is supported as of Stardog 12.1, and only Spark 3.5.0 is supported. See the Compatibility Table for the full version matrix.

How it works

Stardog does not run Spark itself. It relies on the Kubeflow Spark Operator, which must already be installed in the target cluster. A single external compute operation proceeds as follows.

Stardog builds a SparkApplication resource. It reads the SparkApplication template you supply through the template-path property, adds the values specific to the job (the Stardog callback URL, the credentials, the mapping or query, and the target named graph) as Spark arguments, sets the main class, and gives the resource a unique name.
Stardog creates the SparkApplication resource in the cluster. It connects to the cluster’s Kubernetes API and creates one SparkApplication per virtual graph or job. Stardog does not contact the Spark Operator directly. It creates the resource, and the operator reacts to it.
The Spark Operator runs the job. The operator watches for SparkApplication resources in the namespaces it manages. When the new resource appears, it launches a Spark driver pod, which in turn requests its executor pods, standing up a Spark cluster for that single job.
The job reads the source data and writes the results back. The driver and its executors read the source data (a JDBC database, a CSV file, or another Spark data source), transform it to RDF using the Stardog Spark connector, and write the results back to Stardog over the URL given in stardog.host.url.
Stardog polls for completion. When the job finishes, the operator removes the pods, subject to the template’s timeToLiveSeconds.

The shape of the Spark job is controlled by your template, not by Stardog. The number of drivers and executors, their sizing, the container image, how the Stardog Spark connector jar and JDBC drivers are made available to the pods, the volumes, the RBAC, and the namespace are all defined in the SparkApplication template you provide. Stardog reads that template and adds only the per-job pieces (a unique metadata.name, the spec.mainClass, and the spec.arguments) before creating the resource. This split lets you align jobs with your own cluster conventions, such as private registries, secrets, and network policies, without Stardog needing a property for every Spark setting. See SparkApplication template for what the template controls.

Deployment models

There are two ways to deploy, distinguished by where Stardog runs relative to the cluster. This choice determines how Stardog authenticates to the cluster’s Kubernetes API and is controlled by the stardog.external.spark.k8s.in-cluster property.

Stardog runs inside the cluster (in-cluster=true, the default). Stardog is itself a pod in the same Kubernetes cluster, so it authenticates using its own mounted ServiceAccount token. No API server URL or bearer token is required.

Stardog runs outside the cluster (in-cluster=false). Stardog runs on a bare-metal host, a VM, or a cloud instance and reaches the cluster over the network. You must provide an explicit api-server-url and bearer-token, and, depending on how the cluster’s TLS is configured, a ca-cert-path or skip-tls-verify.

The properties for each model are described in Authentication mode below.

Configuration

The integration is configured through the compute properties file passed with -c or --compute. The rest of this section documents the mandatory properties, the authentication properties that depend on your deployment model, and the optional properties, followed by a complete example.

Mandatory properties

Property	Description	Example
stardog.external.compute.platform	Set this property to the name of the external compute platform.	`spark`
stardog.external.spark.k8s.template-path	Absolute path on the Stardog server’s filesystem to the `SparkApplication` YAML template that Stardog will populate and submit.	`/etc/stardog/spark/sparkapp-template.yaml`
stardog.host.url	Stardog URL the Spark cluster will connect back to in order to write results. Must be reachable from inside the Kubernetes cluster (cluster DNS, ingress, or NLB).	`http://stardog.stardog.svc.cluster.local:5820`

Authentication mode

As described under Deployment models, authentication to the cluster’s Kubernetes API is controlled by stardog.external.spark.k8s.in-cluster. The default is true, meaning Stardog runs as a pod in the same cluster and reads its own ServiceAccount token from the standard projected path. Set it to false and provide explicit credentials when Stardog runs outside the cluster, for example on a bare-metal host or cloud instance.

Property	Description	Default
stardog.external.spark.k8s.in-cluster	When `true`, Stardog authenticates to the Kubernetes API using its own pod’s ServiceAccount token. When `false`, the next two properties are required.	`true`
stardog.external.spark.k8s.api-server-url	URL of the Kubernetes API server. Required when `in-cluster=false`.
stardog.external.spark.k8s.bearer-token	Bearer token authorized to create `SparkApplication` resources in the target namespace. Required when `in-cluster=false`.

Optional properties

Property	Description	Default
stardog.external.spark.k8s.ca-cert-path	Absolute path on the Stardog server’s filesystem to the CA certificate used to verify the API server’s TLS certificate. Use this when the cluster API server is fronted by a private CA.
stardog.external.spark.k8s.skip-tls-verify	Set to `true` to skip TLS verification of the API server certificate.	`false`
stardog.external.spark.k8s.reconcile-timeout-seconds	Fail the operation if the submitted `SparkApplication` resource has not received a status from any controller within this many seconds. This is not a job-completion timeout, since the job itself may run for hours. It catches the case where the operator is misconfigured or is not watching the target namespace, and the resource sits at an empty status indefinitely.	`120`
stardog.external.spark.k8s.unknown-grace-seconds	Once a `SparkApplication` starts returning `UNKNOWN` from the API (typically a `404` because the Spark Operator’s `timeToLiveSeconds` reaped the CRD or someone deleted it out-of-band), treat it as transient for this many seconds before promoting it to an error. Prevents a TTL race from being misread as a successful materialization, while still allowing brief operator hiccups to recover without failing the run.	`120`
spark.dataset.repartition	Refer to spark-docs. Set this value to override the default partition behavior.

Properties file example

# Selects the spark-on-k8s provider
stardog.external.compute.platform=spark

# Where Stardog finds the SparkApplication CRD template
stardog.external.spark.k8s.template-path=/etc/stardog/spark/sparkapp-template.yaml

# Callback URL the Spark cluster uses to write back to Stardog. From inside the
# cluster, this is usually a Service DNS name.
stardog.host.url=http://stardog.stardog.svc.cluster.local:5820

# In-cluster auth (default). For out-of-cluster Stardog, set the explicit
# api-server-url + bearer-token below instead.
stardog.external.spark.k8s.in-cluster=true

# Out-of-cluster example (only when in-cluster=false):
# stardog.external.spark.k8s.in-cluster=false
# stardog.external.spark.k8s.api-server-url=https://api.k8s.example.com:6443
# stardog.external.spark.k8s.bearer-token=eyJhbGciOi...
# stardog.external.spark.k8s.ca-cert-path=/etc/stardog/spark/k8s-ca.crt

SparkApplication template

The template-path property points to a Spark Operator SparkApplication YAML document. Stardog reads the template, populates the job parameters (the Stardog server URL, the credentials, the mapping path, the query, and the target named graph) as Spark arguments, and creates the resource in the cluster.

The template controls:

The Spark image: spec.image, spec.sparkVersion, spec.type, and spec.mode.
mainApplicationFile: the path the driver uses to load the Stardog Spark connector jar. Stardog does not upload the connector for the Kubernetes provider, so the template must point to a jar that is already reachable by the pods, for example through a PVC mount, an init container that downloads it, a baked-in image, or an S3 location.
Driver and executor sizing: cores, memory, and instances.
The namespace: metadata.namespace.
The driver’s ServiceAccount: spec.driver.serviceAccount. In Spark’s cluster mode the driver pod creates the executor pods, so this account needs the RBAC the Spark Operator requires to manage executors in the namespace.
Volumes and JDBC driver placement: spec.volumes, spec.driver.volumeMounts, spec.executor.volumeMounts, and spec.deps.jars.
Other operator settings: sparkConf, restartPolicy, timeToLiveSeconds, and any other setting the operator exposes.

Minimal example (PVC-based connector jar delivery):

In this example, both the Stardog Spark connector jar and the JDBC driver jar for the source database are staged on the same persistent volume, which is mounted read-only at /mounted on the driver and executor pods. The connector jar is referenced through mainApplicationFile, and the JDBC driver is added to the job’s classpath through spec.deps.jars. A JDBC driver is required whenever the source is a JDBC database; the example uses the Microsoft SQL Server driver, so substitute the driver that matches your source.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: stardog-spark-template-placeholder
  namespace: stardog-spark
spec:
  type: Java
  mode: cluster
  image: apache/spark:3.5.0
  sparkVersion: 3.5.0
  mainApplicationFile: local:///mounted/stardog-spark-connector.jar
  restartPolicy:
    type: Never
  timeToLiveSeconds: 300
  volumes:
    - name: stardog-jar
      persistentVolumeClaim:
        claimName: stardog-spark-connector
  driver:
    cores: 1
    memory: 2g
    serviceAccount: stardog-spark-runner
    volumeMounts:
      - name: stardog-jar
        mountPath: /mounted
        readOnly: true
  executor:
    instances: 1
    cores: 1
    memory: 4g
    volumeMounts:
      - name: stardog-jar
        mountPath: /mounted
        readOnly: true
  deps:
    # JDBC driver for the source database, staged on the same volume.
    # Use the driver that matches your source.
    jars:
      - local:///mounted/mssql-jdbc.jar

The metadata.name value stardog-spark-template-placeholder is replaced at submission time with a unique per-job name of the form Stardog<OperationName><VirtualGraphName><Timestamp>, the same naming convention used by the Databricks and EMR Serverless providers.

Required cluster prerequisites

Before the first external compute run on a Kubernetes cluster:

Install the Spark Operator in the cluster, typically through its Helm chart, and confirm the operator is watching the namespace referenced in metadata.namespace of the template.
Provision the connector jar wherever the template’s mainApplicationFile points, for example a PVC with an uploader, a ConfigMap-mounted init container, a custom Spark image, or an S3 URL. Stardog does not stage the connector for the Kubernetes provider.
Configure RBAC:
- The identity Stardog uses to authenticate (its own ServiceAccount when in-cluster=true, or the bearer-token user when in-cluster=false) must have create, get, list, watch, and delete on sparkapplications.sparkoperator.k8s.io in the target namespace.
- The driver’s ServiceAccount, set in the template, must have the RBAC the Spark Operator requires to manage executor pods in the same namespace. The driver needs these permissions because, in Spark’s cluster mode, it is the driver that creates the executor pods; the executors themselves do not create other pods.
Configure networking so that the driver and its executors can reach stardog.host.url. If Stardog runs in the cluster, a ClusterIP Service is the usual approach. If Stardog runs outside the cluster, expose it through an Ingress, a LoadBalancer, or a stable DNS name reachable from the pod network.
Provision the JDBC driver jar for the source database when running virtual graph materialization against a JDBC source. The Spark job reads the source data directly, so the driver that matches your source (for example, PostgreSQL, MySQL, Oracle, or SQL Server) must be reachable by the pods and added to the job’s classpath through spec.deps.jars. Stage it the same way as the connector jar, for example on the same PVC.

Pass-through authentication for the virtual graph is not supported on Spark on Kubernetes, consistent with Databricks and EMR Serverless. The Spark job authenticates back to Stardog using the triggering user’s token. See OAuth and Authentication for IDP details.

How it works
Deployment models
Configuration
SparkApplication template
Required cluster prerequisites