Databricks Configuration
This page discusses how to configure Databricks as an external compute platform.
Page Contents
A Databricks Data Source added in Stardog using the data-source add
CLI command or Stardog Studio
can be registered as an external compute platform. Add specific properties in the data source definition to configure the Databricks data source as an external compute platform.
Mandatory properties:
Property | Description | Example |
---|---|---|
external.compute | Boolean value for specifying whether or not data-source is registered as an external compute platform. | true |
external.compute.host.name | Name of the Databricks workspace. | adb-XXXXXXXXXXXXXX.XX.azuredatabricks.net |
databricks.cluster.id | Databricks compute cluster id. | 0704-XXXXXX-XXXXXdir |
stardog.host.url | Stardog URL to which Databricks should connect back to write the results. URL should point to the same Stardog server from where external compute operation is triggered. | https://myhost.stardog.cloud:5820 |
Optional properties:
Property | Description | Default |
---|---|---|
stardog.external.jar.path | Path of the released stardog-spark-connector jar file from where the file should transfer to the Databricks cluster.By default it points to Stardog’s public S3 bucket, where the latest released version is available. There are two options for overriding the default path: 1) The jar can be downloaded from another s3 bucket. In this case, this should point to a custom s3 bucket path. 2) The jar can reside locally on the file system where the Stardog server is running. In this case, it should point to the local file system path. The Stardog server will upload the jar to the Databricks cluster if it is not present on the value specified by stardog.external.jar.upload.path property. Download the latest jar from this link | s3://stardog-spark/stardog-spark-connector-3.0.0.jar |
stardog.external.jar.upload.path | Path of the stardog-spark-connector jar file on Databricks dbfs file system.Should be set both when the jar is uploaded manually by the user and when the jar is uploaded automatically by Stardog. | /FileStore/stardog/ |
stardog.external.mapping.upload.path | Path on the dbfs where Stardog and spark job will write the temporary files. E.g., in the case of Virtual Graph Materialization operation, the mapping of the virtual graph will be stored here. Stardog and spark jobs will delete these temporary files after completing the process. | /FileStore/stardog/ |
stardog.external.databricks.job.timeout | Spark job timeout property in seconds. | 86400 |
stardog.external.databricks.task.timeout | Spark task timeout property in seconds. | 86400 |
stardog.external.databricks.task.retry.count | The number of retries to attempt before the Spark job fails. Set to zero for a single attempt with no retries. | 3 |
stardog.external.databricks.task.retry.interval.millis | Time interval after which Spark jobs make a retry attempt in case of an error. | 2000 |
stardog.external.databricks.is.retry.timeout | Boolean value for specifying whether the Spark job makes a retry attempt or not in case of a timeout error. | false |
stardog.external.databricks.job.on.start.email.list | Comma-separated list of emails to be notified when the Spark job stars. | |
stardog.external.databricks.job.on.success.email.list | Comma-separated list of emails to be notified when the Spark job completes. | |
stardog.external.databricks.job.on.failure.email.list | Comma-separated list of emails to be notified when the Spark job errors out. | |
spark.dataset.repartition | Refer to spark-docs. Set this value to override the default partition behavior. |