EMR Serverless Configuration

This page discusses how to configure EMR Serverless as an external compute platform.

Page Contents

Mandatory properties:
Optional properties:

To run the Spark jobs on EMR Serverless, we need to provide specific properties in the file. This file can be passed as -c or --compute option in the CLI as described in virtual graph materialization and entity resolution sections of external compute.

Following are the mandatory properties to be present in the properties file.

Mandatory properties:

Property	Description	Example
stardog.external.compute.platform	Set this property to the name of the external compute platform.	`emr-serverless`
stardog.external.aws.region	This property contains the AWS region where EMR Studio and and the application are hosted.	`us-east-1`
stardog.external.aws.access.key	AWS Access Key of the temporary credentials.	`ASIXXXXXXXXX4X`
stardog.external.aws.secret.key	AWS Secret Key of the temporary credentials.	`kgweBpwG/CS9j1yTm0AxY7KZ04wRRYrg+3pt8rek`
stardog.external.aws.session.token	AWS Session Token of the temporary credentials.	`FwoGZXIvYXdzEE4aDGJNU`
stardog.external.emr-serverless.application.id	Application Id of the EMR Application.	`00fa02fbl2qujv90`
stardog.external.emr-serverless.execution.role.arn	The role that has access to emr-serverless resources. This should be the same role that is used to generate the temporary credentials.	`arn:aws:iam::626720997297:role/emraccess`
stardog.host.url	Stardog URL to which Databricks should connect back to write the results. URL should point to the same Stardog server from where external compute operation is triggered.	`https://myhost.stardog.cloud:5820`
spark.executorEnv.JAVA_HOME	This is the mandatory property because the latest version of EMR Serverless does not have support for Java11, which is mandatory for the Stardog spark connector. The value should be the path where Java is installed in the custom image.	`/usr/lib/jvm/java-11-openjdk-11.0.18.0.10-1.amzn2.0.1.x86_64`
spark.emr-serverless.driverEnv.JAVA_HOME	This is the mandatory property because the latest version of EMR Serverless does not have support for Java11, which is mandatory for the Stardog spark connector. The value should be the path where Java is installed in the custom image.	`/usr/lib/jvm/java-11-openjdk-11.0.18.0.10-1.amzn2.0.1.x86_64`
stardog.external.jar.path	Path of the released `stardog-spark-connector` jar file. This value should either point to the Stardog’s public S3 bucket, where the latest released version is available or any other S3 bucket where user has placed the jar. Download the latest jar from this link	`s3://stardog-spark/stardog-spark-connector-3.2.0.jar`

If a user needs to provide any spark-specific configurations, those configurations can be set in the same properties file.

Optional properties:

Property	Description	Default
spark.dataset.repartition	Refer to spark-docs. Set this value to override the default partition behavior.

A new role can be configured such that it can access emr-serverless resources and Elastic Container Registry (ECR), where a custom image is deployed. To ensure security and authentication, temporary credentials are to be created using the role and provided in the properties described above table. These credentials include the access key, secret key, and session token and can be programmatically generated using CLI or SDKs.
Also, the custom image for Java 11 has to be built and configured for the EMR application as mentioned in AWS documentation.

Mandatory properties:
Optional properties: