This page discusses how to configure EMR Serverless as an external compute platform.
To run the Spark jobs on EMR Serverless, we need to provide specific properties in the file. This file can be passed as
--compute option in the CLI as described in virtual graph materialization and entity resolution sections of external compute.
Following are the mandatory properties to be present in the properties file.
|stardog.external.compute.platform||Set this property to the name of the external compute platform.|| |
|stardog.external.aws.region||This property contains the AWS region where EMR Studio and and the application are hosted.|| |
|stardog.external.aws.access.key||AWS Access Key of the temporary credentials.|| |
|stardog.external.aws.secret.key||AWS Secret Key of the temporary credentials.|| |
|stardog.external.aws.session.token||AWS Session Token of the temporary credentials.|| |
|stardog.external.emr-serverless.application.id||Application Id of the EMR Application.|| |
|stardog.external.emr-serverless.execution.role.arn||The role that has access to emr-serverless resources. This should be the same role that is used to generate the temporary credentials.|| |
|stardog.host.url||Stardog URL to which Databricks should connect back to write the results. URL should point to the same Stardog server from where external compute operation is triggered.|| |
|spark.executorEnv.JAVA_HOME||This is the mandatory property because the latest version of EMR Serverless does not have support for Java11, which is mandatory for the Stardog spark connector. The value should be the path where Java is installed in the custom image.|| |
|spark.emr-serverless.driverEnv.JAVA_HOME||This is the mandatory property because the latest version of EMR Serverless does not have support for Java11, which is mandatory for the Stardog spark connector. The value should be the path where Java is installed in the custom image.|| |
|stardog.external.jar.path||Path of the released |
Download the latest jar from this link
If a user needs to provide any spark-specific configurations, those configurations can be set in the same properties file.
|spark.dataset.repartition||Refer to spark-docs. Set this value to override the default partition behavior.|
A new role can be configured such that it can access emr-serverless resources and Elastic Container Registry (ECR), where a custom image is deployed. To ensure security and authentication, temporary credentials are to be created using the role and provided in the properties described above table. These credentials include the access key, secret key, and session token and can be programmatically generated using CLI or SDKs.
Also, the custom image for Java 11 has to be built and configured for the EMR application as mentioned in AWS documentation.