Link Search Menu Expand Document
Start for Free

Virtual Graph Materialization

This page discusses how to materialize virtual graph using external compute platform.

Page Contents
  1. Overview
  2. Options
  3. virtual-import
  4. SPARQL update queries
  5. cache-create
  6. Spark Job details

Materialize the virtual graph using the virtual-import CLI command and SPARQL Update Queries (add/copy).

Supported virtual graph data sources are:

  1. JDBC data source supported in Stardog
  2. CSV files accessible from external compute platform
  3. Spark Tables

To use the JDBC data source, establish connectivity of the external compute platform with the data source, and configure the respective JDBC jars.

Overview

The materialization operation of the virtual graph gets converted into a Spark job. This Spark job will be created and triggered on an external-compute platform. The Spark job is dependent on the stardog-spark-connector jar. If this jar is not on the external compute platform the Stardog server will upload the latest compatible version of the stardog-spark-connector.jar based on the configured options. The options vary as per the external compute platform. Please refer the following sections as per the external compute platform.

Then the Spark job will connect to the source, read the data, apply virtual graph mappings and write back the resultant triples into the Stardog database.

Options

Option Description
compute Name of the data source registered as External Compute Platform for databricks and the absolute path to a properties file for emr-serverless.
partition-column This option is only valid for JDBC data sources and supports the single-column name. If there is more than one mapping in the Virtual Graph or the provided mapping file referring to a different column name, try one of these options:

1) All the tables used in mappings should have a column with the same name, e.g., ID.
2) Create multiple Virtual Graphs, i.e., each mapping contains only one table, and consider running multiple external compute operations.

partition-column should be numeric and should have an even distribution of values. Please refer to spark-docs for more details on the partition column.

virtual-import

A new option, compute, is added in the virtual-import command. The materialization workload is pushed to the external compute platform when this option is present. Set the option to the data source name registered as an external-compute platform using data-source options for databricks or compute options for emr-serverless.

To materialize a Virtual Graph using the JDBC source

$ stardog-admin virtual import --compute myDatabricks -s myVGSource myDB myVGMapping.sms


$ stardog-admin virtual import --compute <path-to>/emr-serverless-config.properties -s myVGSource myDB myVGMapping.sms

To materialize a Virtual Graph using JDBC source with partition-column

$ stardog-admin virtual import --compute myDatabricks -s myVGSource myDB myVGMapping.sms --partition-column ID


$ stardog-admin virtual import --compute <path-to>/emr-serverless-config.properties -s myVGSource myDB myVGMapping.sms --partition-column ID

To materialize a Virtual Graph using a CSV file on an external compute platform

$ stardog-admin virtual import --compute myDatabricks myDB myVGMappings.sms /FileStore/myData.csv


$ stardog-admin virtual import --compute <path-to>/emr-serverless-config.properties myDB myVGMappings.sms /<path-to-s3bucket>/myData.csv

To materialize the spark tables on the external compute platform

$ stardog-admin virtual import --compute myDatabricks myDB myDatabricks.properties myDatabricks.sms


$ stardog-admin virtual import --compute <path-to>/emr-serverless-config.properties myDB myDatabricks.properties myDatabricks.sms

Properties file example:

jdbc.url=jdbc:spark://adb-XXXXXX.XX.azuredatabricks.net:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/endpoints/XXXXXXX;
jdbc.username=xxxx
jdbc.password=xxxxx
jdbc.driver=com.simba.spark.jdbc.Driver
sql.schemas=*

SPARQL update queries

External compute options can be passed as Query Hints in SPARQL update (add/copy) queries. These hints are the suggestions to the query optimizer so that the query engine pushes add/copy operations for virtual graphs to an external-compute platform.

To add the virtual graph using external compute

#pragma external.compute.name myDatabricks
ADD <virtual://MyVg> TO :OtherNamedGraph
#pragma external.compute.props stardog.external.compute.platform=emr-serverless;stardog.external.aws.region=us-east-2;stardog.external.aws.access.key=ASIXXXXXXXXXXNP;stardog.external.aws.secret.key=KZQA205ma/pYtLoSxIWrL9oEwX/ut0LAPfAOIFs+;stardog.external.aws.session.token=FwoGZXIvYXdzEFwaDEeDmjJwi;stardog.external.emr-serverless.application.id=00fa02fbl2qujv09;stardog.external.emr-serverless.execution.role.arn=arn:aws:iam::626720994556:role/emraccess;stardog.host.url=http://ec2-44-204-114-111.compute-1.amazonaws.com:5820;stardog.external.jar.path=s3://stardog-spark/stardog-spark-connector-3.1.0.jar
ADD <virtual://MyVg> TO :OtherNamedGraph

To copy the virtual graph using external compute

#pragma external.compute.name   c
COPY <virtual://MyVg> TO :OtherNamedGraph
#pragma external.compute.props stardog.external.compute.platform=emr-serverless;stardog.external.aws.region=us-east-2;stardog.external.aws.access.key=ASIAZD233NYFSCZSMANP;stardog.external.aws.secret.key=KZQA205ma/pYtLoSxIWrL9oEwX/ut0LAPfAOIFs+;stardog.external.aws.session.token=FwoGZXIvYXdzEFwaDEeDmjJwi;stardog.external.emr-serverless.application.id=00fa02fbl2qujv09;stardog.external.emr-serverless.execution.role.arn=arn:aws:iam::626720994556:role/emraccess;stardog.host.url=http://ec2-44-204-114-111.compute-1.amazonaws.com:5820;stardog.external.jar.path=s3://stardog-spark/stardog-spark-connector-3.1.0.jar
COPY <virtual://MyVg> TO :OtherNamedGraph

To copy the virtual graph using external compute along with partition column option

#pragma external.compute.name myDatabricks
#pragma external.compute.partition.column ID
COPY <virtual://MyVg> TO :OtherNamedGraph

cache-create

A new option, compute, is added in the cache-create command. Cache Creation workload is pushed to the external compute platform when this option is present. Set the options to the data source name registered as an external compute platform using data-source options for databricks or compute options for emr-serverless.

The Cache target can not be set to localhost.

To create a cache using external compute

$ cache-create cache://mycache --graph virtual://myVG --target myTarget --compute myDatabricks


$ cache-create cache://mycache --graph virtual://myVG --target myTarget --compute <path-to>/emr-serverless-config.properties

To create a cache using external compute with the partition column provided virtual graph source is JDBC

$ cache-create cache://mycache --graph virtual://myVG --target myTarget --compute myDatabricks --partition-column ID


$ cache-create cache://mycache --graph virtual://myVG --target myTarget --compute <path-to>/emr-serverless-config.properties --partition-column ID

To Refresh the cache

$ cache-refresh cache://mycache

The refresh-script option is not supported with caches that were created using external compute.

Spark Job details

As discussed in various sections like overview, How it Works, and Architecture, Stardog creates a new Spark job for each executed external compute-supported operation. This section describes the details around the Spark job.

The name of the Spark job will be of the format: Stardog<OperationName><VirtualGraphName><Timestamp>
E.g., StardogMaterializationMyVG11668700425701, in this example, Materialization is the operation name, and MyVG is the name of the virtual graph followed by the timestamp when Stardog created this job.

The external compute platform will list the Spark jobs created by Stardog under their workflow/jobs management. E.g. in the case of Databricks, these jobs are visible under the workflows as shown:

spark jobs

For emr-serverless, the jobs are listed in EMR Studio’s application as shown: emrserverless jobs


Each Spark job created by the Stardog will have only one task with the same name as that of the job. For E.g., in case of Databricks, the task is visible under the job as shown:

spark task

In case of emr-serverless, we can click on the job to view the details.

spark task
Stardog will trigger multiple runs of the created Spark job in case there is more than one mapping in the Virtual Graph or the provided mapping file. E.g., in the case of Databricks, the multiple runs are visible as shown:

spark multiple run



Stardog converts the virtual graph materialization configuration like source details, mapping, and Stardog details into job parameters and then pass these parameters at runtime to the Spark job. These job parameters are generated automatically by Stardog.
List of job parameters:

parameter name Description
jdbc.url URL of the JDBC data source.
This is only passed when the source of the Virtual Graph is JDBC.
jdbc.username Datasource user configured in Stardog.
Spark job will use this user name to authenticate into the JDBC database.
This is only passed when the source of the Virtual Graph is JDBC.
jdbc.password Datasource password configured in Stardog. Stardog encrypts the password while passing this parameter over a secured REST HTTPS call.
Spark job will use this password to authenticate into the JDBC database.
This is only passed when the source of the Virtual Graph is JDBC.
jdbc.driver JDBC driver class name.
This is only passed when the source of the Virtual Graph is JDBC.
stardog.server Stardog server URL from where external compute operation is triggered. The Spark job will connect back to this URL to write the results.
stardog.username Stardog user who has triggered the external compute operation.
Spark job will use this user name to authenticate into the Stardog server.
stardog.password Auth token of the Stardog user who has triggered the external compute operation.
Stardog encrypts this token while passing this parameter over a secured REST HTTPS call.
Spark job will use this auth token to authenticate into the Stardog server.
stardog.database Stardog Database name. Spark job will connect this database to write back the results.
write.ng.iri Named Graph.
write.input.sms This parameter refers to the path of the mapping file uploaded by Stardog. Spark job will apply this mapping to generate results.
csv.location Location of the CSV file on the dbfs. This is only passed when the source of the Virtual Graph is JDBC.