This page discusses how to set up the Stardog Spark connector for running graph analytics algorithms.
Stardog supports graph analytics capabilites via integration with Apache Spark. Stardog Spark connector is compatible with Spark 3.X API can can be used with any Spark cluster deployed via Azure Databricks, Amazon EMR or any other means. Stardog Spark connector is compatible with Stardog 7.*.
Stardog Spark connector pulls data from a Stardog database into Spark, runs the graph analytics algorithm selected and writes the results back to Stardog. You should have a Stardog server (single node or cluster) and a Spark cluster running to use this capability. For testing purposes you can also run Spark locally on your machine as explained below but this option will not work for large-scale analytics.
In order to use the graph analytics capabilities you need to complete the following steps:
- Download the latest Stardog Spark connector
- Setup input parameters for the algorithm
- Submit a Spark job using the connector and the algorithm parameters
We explain these steps in more details in the following sections.
The input parameters to graph analytics specify information such the algorithm that will be run, configuration options for the algorithm, Stardog connection parameters and options to configure how the graph results should be saved to Stardog. The input parameters can be written in a Java-style properties file. An example file for parameters looks as follows:
# Algorithm parameters
# Stardog connection parameters
# Output parameters
Note that, if you need to use multiline values in Java properties files you will need to use backslash (‘') at the end of a line to indicate the next line is a part of the value.
Not all the parameters shown above are required. More detailed information about input parameters can be found in the Graph Analytics Algorithms section.
The input parameters can also be specified as arguments to graph analytics program. In this case, each ‘key=value’ pair would be passed separated by space.
algorithm.name=PageRank algorithm.iterations=5 stardog.server=http://localhost:5820 stardog.database=testDB output.property=example:analytics:rank output.graph=example:analytics:graph
See the examples below to see how parameters are passed for execution.
We provide basic instructions to run graph analytics using command line, Databricks environment, and Amazon EMR. If you are using a different Spark installation you should be able to use these steps as guides to submit jobs.
If you download Apache Spark library locally you can use the standard
spark-submit command to run the graph analytics algorithms. Make sure in the console you navigate to the Apache Spark directory and execute the following command with the input parameters file you created:
$ bin/spark-submit --master local[*] --files example.properties <path-to-connector>/stardog-spark-connector-VERSION.jar example.properties
local[*] means the job is being submitted to a local Spark cluster with as many worker threads as logical cores on your machine. You can change the number of threads or use a remote Spark cluster location. Please refer to Spark documentation for details. The
<path-to-connector> should point to the directory where you downloaded the Stardog Spark connector and the
VERSION should be replaced by the version you downloaded.
If the input parameters are specified in the command line instead of in a file the command would look as follows:
$ bin/spark-submit --master local <path-to-connector>/stardog-spark-connector-VERSION.jar algorithm.name=PageRank stardog.server=http://localhost:5820 stardog.database=testDB output.property=example:analytics:rank output.graph=example:analytics:graph
You can use Stardog graph analytics in Databricks Runtime 7.0 or later which supports Apache Spark 3.0. You should make sure that the Spark cluster is launched with a compatible runtime:
Follow the regular steps to create a job and assign a name to the new job.
In the Job settings, click the “Set JAR” option:
In the “Upload JAR to Run” window that pops up, select the Stardog Spark connector jar file you downloaded, enter “com.stardog.spark.GraphAnalytics” as the “Main class” and the input parameters as key value pairs:
Once you click “OK” the job will be created and you can run the analytics algorithm by clicking “Run Now” button:
After the job is created you can go back and edit the parameters by clicking the “Edit” button.
You can use Stardog graph analytics with Amazon EMR 6.1.0 or later which supports Apache Spark 3.0. You should make sure that the EMR environment is launched with a compatible runtime:
You need to upload the Stardog spark connector you downloaded to an S3 bucket that is accessible by your EMR cluster. Please follow AWS instructions to upload the jar to a bucket.
In the EMR console, go to the “Steps” tab and click “Add step” button:
In the “Add step” dialog, select “Spark application” as the “Step type”, enter a descriptive step name, select the S3 location the connector is uploaded to, and enter the input parameters as key value pairs:
Once you click the “Add” button in this dialog the graph analytics algorithm will start running immediately.
If you would like to rerun the algorithm, you can select the step from the list and click “Clone step”. In the dialog that pops up you can edit the input parameters if you would like and run the step again.