Link Search Menu Expand Document
Start for Free

Spark Programs

This page discusses how to write custom Spark applications using the Stardog Spark connector.

Page Contents
  1. Overview
  2. Java Example
  3. Schema Inference

Overview

Stardog Spark connector allows users to create Spark Dataset instances backed by a Stardog connection and a SELECT query. Once the dataset is created all the regular Spark functionality can be used in custom Spark applications.

Java Example

Creating Stardog datasets requires setting connection parameters.

import com.stardog.spark.datasource.StardogSource;
import com.stardog.spark.utils.Options;

// Set connection parameters for Stardog  
Map<String, String> options = new HashMap<>();
options.put(Options.SERVER.getName(), "http://localhost:5820");
options.put(Options.DATABASE.getName(), "testDB");

// SELECT query should be used
options.put(Options.QUERY.getName(), "SELECT * { ... } ");

// Create a Spark session
SparkSession spark = ...

// Create a Stardog dataset
Dataset<Row> dataset = spark.read()
                            .format(StardogSource.class.getName())
                            .options(options)
                            .load();

Schema Inference

Stardog Spark connector analyzes the input SELECT query and tries to infer a schema for the dataset. Schema inference inspects the properties used in the triple patterns of the query and queries the Stardog database to retrieve any rdfs:range triples that might be defined for the property. If the select query uses variable predicates or complex SPARQL patterns or the database does not contain range definitions the schema inference would fail and a generic type will be assigned to the columns. In such cases the user can supply a schema for the dataset using the Dataset.schema() function.

The RDF types are mapped to Spark types using the following approach: IRIs and bnodes from RDF are always mapped to strings on the Spark side. Literals are mapped to built-in Spark types using the correspondence table below.

RDF Type Spark Type
xsd:string string
xsd:boolean boolean
xsd:byte byte
xsd:short short
xsd:int integer
xsd:long long
xsd:integer long
xsd:decimal decimal
xsd:float float
xsd:double double