Usage

This page discusses how to use the Stardog Spark connector for running graph analytics algorithms.

Page Contents

Setup
1. Creating the data base
2. Three Ontologies
Command Line
1. A sample properties file
  1. Command line all the way!
2. Using schemas
Databricks
1. A GraphAnalytics notebook
2. Run as a Job
Amazon EMR

We provide basic instructions to run graph analytics using command line, Databricks environment, and Amazon EMR, taught through a motivating example. If you are using a different Spark installation, you should be able to use these steps as guides to submit jobs.

If you have not read the Setup page, please do so before proceeding. All files used throughout this tutorial can be found in the stardog-examples GitHub repository.

Setup

The data for this example are very simple. The use case is a collection of routers and their connections. CSV file routers.csv contains an edge list of the connections between the routers. The connections are uni-directional. Each router belongs to one of two classes: Local or Regional. The regional routers typically have more out-edges than the local ones. We can think of them as major junction points for the network.

In this tutorial, you will learn:

How to perform a graph analytics algorithm on the database to identify connected components in the graph.
How to create different ontological schemas for the same data.
How to the graph analytics algorithms with different schemas and how the results differ.

Creating the data base

The following script builds the demo and runs a couple of test queries. If you are running the example against some instance other than localhost with the default username and password (admin, admin), you will need to add parameters for the server, username, and password. For example, line 3 becomes stardog-admin --server "https://my-stardog-instance.com:5820" db drop -u myUsername -p myPassword router

01: #! /bin/bash

03: stardog-admin db drop router
04: stardog-admin db create -n router


07: stardog namespace add --prefix net --uri http://routers.stardog.com/ router

09: stardog-admin virtual import router scope.sms routers_scope.csv
10: stardog-admin virtual import router routers.sms routers.csv

12: stardog data add router -g net:basic basic.ttl
13: stardog data add router -g net:onto onto.ttl
14: stardog data add router -g net:sym onto_symmetric.ttl

16: stardog reasoning schema --add basic --graphs net:basic -- router
17: stardog reasoning schema --add onto --graphs net:onto -- router
18: stardog reasoning schema --add sym --graphs net:sym -- router

20: stardog query router "select (count(*) as ?n) {?s ?p ?o .}"

23: echo 'stardog query --schema onto  router "select * {  net:r_465 net:connects ?o .}"'
24: stardog query --schema basic  router "select * {  net:r_465 net:connects ?o .}"

26: echo 'stardog query --schema sym router "select * {  net:r_465 net:connects ?o .}"'
27: stardog query --schema sym router "select * {  net:r_465 net:connects ?o .}"

Lines 3-4. Drop and create database router
Line 7. Add a namespace
Lines 9-10. Import the two data files and their mappings (scope.sms, routers_scope.csv, routers.sms, routers.csv)
Lines 12-14. Import 3 different ontologies (basic.ttl, onto.ttl, onto_symmetric.ttl) to 3 named graphs
Lines 16-18. Create 3 schemas from these named graphs.
Line 24. Find r_465’s connections using the basic schema
Line 27. Fine r_465’s connections using the sym schema.

Three Ontologies

Turtle files basic.ttl, onto.ttl and sym.ttl contain three different ontologies for the router data. basic defines properties and classes. onto further indicates that Regional and Local are subclasses of a common class, Router. sym provides a rule to make the connection edges symmetric.

The queries in lines 24 and 27 of the build file illustrate the difference that symmetry makes. Router net:r_565 connects to 4 routers when ignoring edge direction and only connects to 2 when respecting edge direction.

Command Line

If you download Apache Spark library locally, you can use the standard spark-submit command to run the graph analytics algorithms. In the console, navigate to the Apache Spark directory, and execute the following command:

$ bin/spark-submit --master local[*] --files router.properties  <path-to-connector>/stardog-spark-connector-VERSION.jar router.properties

The argument local[*] means the job is being submitted to a local Spark cluster with as many worker threads as logical cores on your machine. You can change the number of threads or use a remote Spark cluster location. Please refer to the Spark documentation for details. The <path-to-connector> should point to the directory where you downloaded the Stardog Spark connector, and the VERSION should be replaced by the version you downloaded. The properties file contains the parameters needed to specify the jobs. Here is a sample properties file.

A sample properties file

01: # Algorithm parameters
02: # algorithm.name=ConnectedComponents
03: # algorithm.name=LabelPropagation
04: # algorithm.name=PageRank
05: algorithm.name=StronglyConnectedComponents
06: # algorithm.name=TriangleCount
07: algorithm.iterations=10

09: # Stardog connection parameters
10: stardog.server=http://localhost:5820
11: stardog.database=router
12: stardog.username=admin
13: stardog.password=admin
14: stardog.query.timeout=10m
15: #stardog.reasoning=true
16: stardog.reasoning.schema=sym
17: #stardog.query=construct {?s ?p ?o .} from <some:graph> where {?s ?p ?o .}

19: # Output parameters
20: output.property=http://routers.stardog.com/sym/component
21: output.graph=http://routers.stardog.com/sym

23: # Spark parameters
24: spark.dataset.size=12000

You can find this file here.

Lines 1-7 give the available graphX algorithms. For a brief description of what they do, see Algorithms. This tutorial uses connected components.
Lines 9-14. Endpoint of the Stardog instance, the username and password.
Line 15. If the query needs reasoning ON, set it here
Line 16. If using a named schema, name it here. If stardog.reasoning.schema is not null, reasoning is assumed and there is no need to set stardog.reasoning.
Line 17. By default, the algorithm is applied to the ENTIRE graph, including any ontology pieces you may have thrown in there. The query allows you to limit the scope of the algorithm. Note that it is a CONSTRUCT query, rather than the usual SELECT. Note also that it is not surrounded by quotes, even though it is a string.
Line 20-21. The results of the analysis are added to a named graph. With strongly connected components, each class node from the query is assigned a component label, using the datatype property defined in Line 20.
Line 24. spark.dataset.size is the approximate number of triples in the graph. Obtain these number by running the standard query, select (count(*) as ?n) {?s ?p ?o .} .See what other Spark parameters are available here. In most cases, the defaults are fine.

Command line all the way!

You can specify these parameters on the command line, dispensing with the properties file, as follows:

$ bin/spark-submit --master local[8] <path-to-connector>/stardog-spark-connector-VERSION.jar algorithm.name=StronglyConnectedComponents algorithm.iterations=10 stardog.server=http://localhost:5820 stardog.database=router output.property=http://routers.stardog.com/sym/component output.graph=http://routers.stardog.com/sym

Using schemas

You can experiment with the router schemas and the strongly connected components algorithm.

The onto schema allows us to treat Regional and Local routers as entities of class Router. The relevant lines from the properties file are:

stardog.reasoning.schema=onto
stardog.query=construct {?r1 ?p ?r2 .} where {?r1 a net:Router ; ?p ?r2 . ?r2 a net:Router . }

# Output parameters
output.property=http://routers.stardog.com/ontoComp/component
output.graph=http://routers.stardog.com/ontoComp

Next, we can run the algorithm with the symmetric ontology, sym, which allows connections to run both ways. The data are saved to a different named graph.

stardog.reasoning.schema=sym
stardog.query=construct {?r1 ?p ?r2 .} where {?r1 a net:Router ; ?p ?r2 . ?r2 a net:Router . }

# Output parameters
output.property=http://routers.stardog.com/symComp/component
output.graph=http://routers.stardog.com/symComp

In the first case, we get 2113 distinct components. In other words, each router is its own component. With bi-directional edges (when using the sym ontology), we get only 1 component, as the graph is connected when directionality is ignored.

To check this result, run

stardog query router "select (count(distinct(?component)) as ?n) {graph net:symComp {?s ?p ?component}}"

followed by,

stardog query router "select (count(distinct(?component)) as ?n) {graph net:ontoComp {?s ?p ?component}}"

Databricks

You can use Stardog graph analytics in Databricks Runtime 7.0 or later, which supports Apache Spark 3.0. You should make sure that the Spark cluster is launched with a compatible runtime:

Cluster runtime

Graph analytics can be run from Databricks, along the other utilities in the Spark Connector. The best ways are:

Run from a notebook.
Create a job to run the notebook if you want this to be part of a larger workflow or to occur at a scheduled time.

A GraphAnalytics notebook

Add the spark connector jar to the Databricks workspace. (right click in the workspace => create library => upload jar)

jarJob

Create a databricks notebook with Scala and add the following code to the cells.

// Databricks notebook source
import com.stardog.spark.GraphAnalytics

val sgServer = "https://solutions-demo.stardog.cloud:5820"
val pw = dbutils.secrets.get("your_scope", "your_password_key")
val userName = dbutils.secrets.get("your_scope, "your_username_key")
val dbName = "router"



// COMMAND ----------

val q = "construct {?r1 ?p ?r2 .} where {?r1 a net:Router; ?p ?r2 . ?r2 a net:Router . }"

// COMMAND ----------

val params = Array(
  "algorithm.name=StronglyConnectedComponents",
  "algorithm.iterations=5",
  "stardog.server=" + sgServer,
  "stardog.database=" + dbName,
  "stardog.username=" + userName,
  "stardog.password=" + pw,
  "stardog.query.timeout=10m",
  "stardog.reasoning.schema=onto",
  "stardog.query=" + q,
  "output.property=http://routers.stardog.com/ontoComp/component",
  "output.graph=http://routers.stardog.com/ontoComp",
  "spark.dataset.size=12000"
)

// COMMAND ----------

GraphAnalytics.main(params)

// COMMAND ----------

This notebook runs the Strongly Connected Components algorithm using the onto ontology.

Run as a Job

Select the Workflows tab from the menu. Select create job. Complete the task page with the following information. Under Path*, insert the path to your notebook. For Dependent libraries, click on the form and you will get the opportunity to load the jar file. The jar uploaded to the workspace library does not work in this context. Now, run the job!

workflow

Amazon EMR

You can use Stardog graph analytics with Amazon EMR 6.1.0 or later, which supports Apache Spark 3.0. You should make sure that the EMR environment is launched with a compatible runtime:

EMR version

You need to upload the Stardog spark connector you downloaded to an S3 bucket that is accessible by your EMR cluster. Please follow AWS instructions to upload the jar to a bucket.

In the EMR console, go to the “Steps” tab and click “Add step” button:

Add step

In the “Add step” dialog, select “Spark application” as the “Step type”, enter a descriptive step name, select the S3 location the connector is uploaded to, and enter the input parameters as key value pairs:

Configure step

Once you click the “Add” button in this dialog, the graph analytics algorithm will start running immediately.

If you would like to rerun the algorithm, you can select the step from the list and click “Clone step”. In the dialog that pops up, you can edit the input parameters and then run the step again.

Setup
- Creating the data base
- Three Ontologies
Command Line
- A sample properties file
- Using schemas
Databricks
- A GraphAnalytics notebook
- Run as a Job
Amazon EMR