Link Search Menu Expand Document
Start for Free

EMR Serverless Configuration

This page discusses how to configure EMR Serverless as an external compute platform.

Page Contents
  1. Mandatory properties:
  2. Optional properties:

To run the Spark jobs on EMR Serverless, we need to provide specific properties in the file. This file can be passed as -c or --compute option in the CLI as described in virtual graph materialization and entity resolution sections of external compute.

Following are the mandatory properties to be present in the properties file.

Mandatory properties:

Property Description Example
stardog.external.compute.platform Set this property to the name of the external compute platform. emr-serverless
stardog.external.aws.region This property contains the AWS region where EMR Studio and and the application are hosted. us-east-1
stardog.external.aws.access.key AWS Access Key of the temporary credentials. ASIXXXXXXXXX4X
stardog.external.aws.secret.key AWS Secret Key of the temporary credentials. kgweBpwG/CS9j1yTm0AxY7KZ04wRRYrg+3pt8rek
stardog.external.aws.session.token AWS Session Token of the temporary credentials. FwoGZXIvYXdzEE4aDGJNU
stardog.external.emr-serverless.application.id Application Id of the EMR Application. 00fa02fbl2qujv90
stardog.external.emr-serverless.execution.role.arn The role that has access to emr-serverless resources. This should be the same role that is used to generate the temporary credentials. arn:aws:iam::626720997297:role/emraccess
stardog.host.url Stardog URL to which Databricks should connect back to write the results. URL should point to the same Stardog server from where external compute operation is triggered. https://myhost.stardog.cloud:5820
spark.executorEnv.JAVA_HOME This is the mandatory property because the latest version of EMR Serverless does not have support for Java11, which is mandatory for the Stardog spark connector. The value should be the path where Java is installed in the custom image. /usr/lib/jvm/java-11-openjdk-11.0.18.0.10-1.amzn2.0.1.x86_64
spark.emr-serverless.driverEnv.JAVA_HOME This is the mandatory property because the latest version of EMR Serverless does not have support for Java11, which is mandatory for the Stardog spark connector. The value should be the path where Java is installed in the custom image. /usr/lib/jvm/java-11-openjdk-11.0.18.0.10-1.amzn2.0.1.x86_64
stardog.external.jar.path Path of the released stardog-spark-connector jar file. This value should either point to the Stardog’s public S3 bucket, where the latest released version is available or any other S3 bucket where user has placed the jar.
Download the latest jar from this link
s3://stardog-spark/stardog-spark-connector-3.1.0.jar


If a user needs to provide any spark-specific configurations, those configurations can be set in the same properties file.

Optional properties:

Property Description Default
spark.dataset.repartition Refer to spark-docs. Set this value to override the default partition behavior.  

A new role can be configured such that it can access emr-serverless resources and Elastic Container Registry (ECR), where a custom image is deployed. To ensure security and authentication, temporary credentials are to be created using the role and provided in the properties described above table. These credentials include the access key, secret key, and session token and can be programmatically generated using CLI or SDKs.
Also, the custom image for Java 11 has to be built and configured for the EMR application as mentioned in AWS documentation.