Link Search Menu Expand Document
Start for Free

Databricks Configuration

This page discusses how to configure Databricks as an external compute platform.

Page Contents
  1. Mandatory properties:
  2. Optional properties:

A Databricks Data Source added in Stardog using the data-source add CLI command or Stardog Studio can be registered as an external compute platform. Add specific properties in the data source definition to configure the Databricks data source as an external compute platform.

Mandatory properties:

Property Description Example
external.compute Boolean value for specifying whether or not data-source is registered as an external compute platform. true
external.compute.host.name Name of the Databricks workspace. adb-XXXXXXXXXXXXXX.XX.azuredatabricks.net
databricks.cluster.id Databricks compute cluster id. 0704-XXXXXX-XXXXXdir
stardog.host.url Stardog URL to which Databricks should connect back to write the results. URL should point to the same Stardog server from where external compute operation is triggered. https://myhost.stardog.cloud:5820

Optional properties:

Property Description Default
stardog.external.jar.path Path of the released stardog-spark-connector jar file from where the file should transfer to the Databricks cluster.

By default it points to Stardog’s public S3 bucket, where the latest released version is available.

There are two options for overriding the default path:

1) The jar can be downloaded from another s3 bucket. In this case, this should point to a custom s3 bucket path.

2) The jar can reside locally on the file system where the Stardog server is running. In this case, it should point to the local file system path.

The Stardog server will upload the jar to the Databricks cluster if it is not present on the value specified by stardog.external.jar.upload.path property.

Download the latest jar from this link
s3://stardog-spark/stardog-spark-connector-3.0.0.jar
stardog.external.jar.upload.path Path of the stardog-spark-connector jar file on Databricks dbfs file system.

Should be set both when the jar is uploaded manually by the user and when the jar is uploaded automatically by Stardog.
/FileStore/stardog/
stardog.external.mapping.upload.path Path on the dbfs where Stardog and spark job will write the temporary files. E.g., in the case of Virtual Graph Materialization operation, the mapping of the virtual graph will be stored here. Stardog and spark jobs will delete these temporary files after completing the process. /FileStore/stardog/
stardog.external.databricks.job.timeout Spark job timeout property in seconds. 86400
stardog.external.databricks.task.timeout Spark task timeout property in seconds. 86400
stardog.external.databricks.task.retry.count The number of retries to attempt before the Spark job fails. Set to zero for a single attempt with no retries. 3
stardog.external.databricks.task.retry.interval.millis Time interval after which Spark jobs make a retry attempt in case of an error. 2000
stardog.external.databricks.is.retry.timeout Boolean value for specifying whether the Spark job makes a retry attempt or not in case of a timeout error. false
stardog.external.databricks.job.on.start.email.list Comma-separated list of emails to be notified when the Spark job stars.  
stardog.external.databricks.job.on.success.email.list Comma-separated list of emails to be notified when the Spark job completes.  
stardog.external.databricks.job.on.failure.email.list Comma-separated list of emails to be notified when the Spark job errors out.  
spark.dataset.repartition Refer to spark-docs. Set this value to override the default partition behavior.