Mapping Data Sources

This page discusses mapping data sources in detail to connect to a virtual graph.

Page Contents

Overview
R2RML and SMS (Stardog Mapping Syntax)
SMS2 (Stardog Mapping Syntax 2)

Overview

As discussed in the How Virtual Graphs Work, a data mapping file is one of the three components that make up a Virtual Graph (the others being a unique name and properties file).

Stardog Virtual Graphs supports 3 mapping formats, but not all data sources support all formats. Moreover, one mapping format, SMS2, supports multiple source data models via the FROM clause. Please review the following tables to understand your options.

Relational Databases

Datasource	R2RML & SMS	SMS2	Source Data Model	Functions in Bind Expressions
*	Yes	Yes	SQL	Type casts

NoSQL Databases

Datasource	R2RML & SMS	SMS2	Source Data Model	Functions in Bind Expressions
Cassandra	Yes	Yes	SQL	Type casts
Cosmos DB	No	Yes	JSON/GraphQL	Type casts
MongoDB	No	Yes	JSON/GraphQL	Type casts
Elasticsearch	No	Yes	JSON/GraphQL	Type casts

Static Files

Datasource	R2RML & SMS	SMS2	Source Data Model	Functions in Bind Expressions
CSV	Yes	Yes	CSV	Full support
JSON	No	Yes	JSON/GraphQL	Full Support

Support for type cast functions in BIND expressions means only the template and datatype casting functions are available. Valid datatype casting functions include specific types, eg. xsd:integer(?field), and the generic strdt() function which can also accept a user-defined type. Full support makes all SPARQL functions available.

We recommend using SMS2 for supporting all features.

R2RML and SMS (Stardog Mapping Syntax)

R2RML is the W3C-recommended language for mapping relational databases to RDF. For this reason all SMS mappings, and all SMS2 mappings using FROM SQL, can be converted to R2RML.

The Stardog Mapping Syntax (SMS) is an alternative way to write R2RML mappings that is much simpler to read and write than R2RML.

We will use the example database from the R2RML specification to explain SMS. The SQL schema that corresponds to this example is:

CREATE TABLE "DEPT" (
      "deptno" INTEGER UNIQUE,
      "dname" VARCHAR(30),
      "loc" VARCHAR(100));
INSERT INTO "DEPT" ("deptno", "dname", "loc")
       VALUES (10, 'APPSERVER', 'NEW YORK');

CREATE TABLE "EMP" (
      "empno" INTEGER PRIMARY KEY,
      "ename" VARCHAR(100),
      "job" VARCHAR(30),
      "deptno" INTEGER REFERENCES "DEPT" ("deptno"),
      "etype" VARCHAR(30));
INSERT INTO "EMP" ("empno", "ename", "job", "deptno", "etype" )
       VALUES (7369, 'SMITH', 'CLERK', 10, 'PART_TIME');

Suppose we would like to represent this information in RDF using the same translation for job codes as in the original example:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix emp: <http://example.com/emp/> .
@prefix dept: <http://example.com/dept/> .

dept:10 a dept:Department ;
    dept:location "NEW YORK" ;
    dept:deptno "10"^^xsd:integer .

emp:7369 a emp:Employee ;
    emp:name "SMITH" ;
    emp:role emp:general-office ;
    emp:department dept:10 .

SMS looks very similar to the output RDF representation:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix emp: <http://example.com/emp/> .
@prefix dept: <http://example.com/dept/> .
@prefix sm: <tag:stardog:api:mapping:> .

dept:{"deptno"} a dept:Department ;
    dept:location "{\"loc\"}" ;
    dept:deptno "{\"deptno\"}"^^xsd:integer ;
    sm:map [
      sm:table "DEPT" ;
    ] .

emp:{"empno"} a emp:Employee ;
    emp:name "{\"ename\"}" ;
    emp:role emp:{ROLE} ;
    emp:department dept:{"deptno"} ;
    sm:map [
      sm:query """
        SELECT \"empno\", \"ename\", \"deptno\", (CASE \"job\"
            WHEN 'CLERK' THEN 'general-office'
            WHEN 'NIGHTGUARD' THEN 'security'
            WHEN 'ENGINEER' THEN 'engineering'
        END) AS ROLE FROM \"EMP\"
        """ ;
    ] .

SMS is based on Turtle, but it’s not valid Turtle since it uses the URI templates of R2RML – curly braces can appear in URIs. Other than this difference, we can treat an SMS document as a set of RDF triples. SMS documents use the special namespace tag:stardog:api:mapping: that we will represent with the sm prefix below.

Every subject in the SMS document that has a sm:map property maps a single row from the corresponding table/view to one or more triples. If an existing table/view is being mapped, sm:table is used to refer to the table. Alternatively, a SQL query can be provided inline using the sm:query property.

The values generated will be a URI, blank node, or a literal based on the type of the value used in the mapping. The column names referenced between curly braces will be replaced with the corresponding values from the matching row.

SMS can be translated to the standard R2RML syntax automatically by Stardog. For completeness, we provide the R2RML mappings corresponding to the above example:

@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix emp: <http://example.com/emp#> .
@prefix dept: <http://example.com/dept#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@base <http://example.com/base/> .

<DeptTriplesMap>
    a rr:TriplesMap;
    rr:logicalTable [ rr:tableName "DEPT" ];
    rr:subjectMap [ rr:template "http://data.example.com/dept/{\"deptno\"}" ;
                    rr:class dept:Department ];
    rr:predicateObjectMap [
      rr:predicate      dept:deptno ;
      rr:objectMap    [ rr:column "\"deptno\""; rr:datatype xsd:positiveInteger ]
    ];
    rr:predicateObjectMap [
      rr:predicate    dept:location ;
      rr:objectMap    [ rr:column "\"loc\"" ]
    ].

<EmpTriplesMap>
    a rr:TriplesMap;
    rr:logicalTable [ rr:sqlQuery """
        SELECT "EMP".*, (CASE "job"
            WHEN 'CLERK' THEN 'general-office'
            WHEN 'NIGHTGUARD' THEN 'security'
            WHEN 'ENGINEER' THEN 'engineering'
        END) AS ROLE FROM "EMP"
        """ ];
    rr:subjectMap [
        rr:template "http://data.example.com/employee/{\"empno\"}";
        rr:class emp:Employee
    ];
    rr:predicateObjectMap [
      rr:predicate        emp:name ;
      rr:objectMap    [ rr:column "\"ename\"" ];
    ];
    rr:predicateObjectMap [
        rr:predicate emp:role;
        rr:objectMap [ rr:template "http://data.example.com/roles/{ROLE}" ];
    ];
    rr:predicateObjectMap [
        rr:predicate emp:department;
        rr:objectMap [ rr:template "http://example.com/dept/{\"deptno\"}"; ];
    ].

SMS2 (Stardog Mapping Syntax 2)

Stardog Mapping Syntax 2 (SMS2) is a way to represent virtual graph mappings that is designed to support a broader range of source data formats than R2RML and SMS, including semi-structured data sources such as JSON, MongoDB and Elasticsearch, as well as structured formats like SQL RDBMS.

SMS2 is loosely based on the SPARQL CONSTRUCT query. An abbreviated example looks like this:

PREFIX : <http://stardog.com/movies/>

MAPPING <urn:movies>
FROM JSON {
  "movie":{
    "_id":"?movieId",
    "name":"?name",
  }
}
TO {
  ?movie a :Movie ;
    :name ?name .
}
WHERE {
  BIND (template("http://stardog.com/movies/Title_{movieId}") AS ?movie)
}

SMS2 consists of five parts: PROLOGUE, MAPPING, FROM, TO, and WHERE.

The PROLOGUE is a series of prefix declarations at the beginning of the file.
The MAPPING through WHERE clauses define a mapping and the set of them can be repeated, separated by a semicolon.
1. The MAPPING clause consists of the MAPPING keyword followed by an optional IRI for naming the mapping.
The FROM clause describes the input. The FROM clause starts with the FROM keyword and is followed by a data format keyword. JSON in this example, but it can be CSV, GraphQL , or SQL. It is followed by a definition that describes the structure of the data and assigns fields to variable names.
The TO clause defines how the output RDF should look. It is analogous to the CONSTRUCT portion of the SPARQL CONSTRUCT query. It consists of a set of triples where variables can be used in any position.
The WHERE clause is where you can transform source data and BIND the transformed values to new variables. The currently supported functions for use within BIND are template for IRI construction and the cast functions (xsd:string, xsd:boolean, xsd:integer, xsd:float, xsd:double, xsd:decimal, xsd:dateTime, xsd:date) for literal value type conversion.

Notice there are no platform-specific query elements (such as MongoDB query syntax) present in the mapping, only descriptions of the source and target data schemas and transformations for mapping the relationship between the source and target.

To help illustrate SMS2, we’ll use the following JSON for a movie collection from a MongoDB database:

{
   "_id":"unforgiven",
   "name":"Unforgiven",
   "datePublished":new Date("1992-08-07T00:00:00.000Z"),
   "genre":["Drama", "Western"],
   "boxOffice":101157447,
   "description":"Retired gunslinger reluctantly takes on one last job.",
   "director":[
      {"director":"clintEastwood", "name":"Clint Eastwood"}
   ],
   "actor":[
      {"actor":"morganFreeman", "name":"Morgan Freeman"},
      {"actor":"clintEastwood", "name":"Clint Eastwood"},
      {"actor":"geneHackman", "name":"Gene Hackman"}
   ]
}
{
   "_id":"noWayOut",
   "name":"No Way Out",
   "datePublished":new Date("1987-08-14T00:00:00.000Z"),
   "genre":["Action", "Mystery", "Drama", "Thriller"],
   "boxOffice":35509515,
   "description":"A coverup and witchhunt occur after a politician accidentally kills his mistress.",
   "director":[
      {"director":"rogerDonaldson", "name":"Roger Donaldson"}
   ],
   "actor":[
      {"actor":"geneHackman", "name":"Gene Hackman"},
      {"actor":"kevinCostner", "name":"Kevin Costner"}
   ]
}

For this example we’ll create mappings that represent the data as this RDF:

@prefix : <http://stardog.com/movies/> .

:Title_noWayOut a :Movie ;
    :name "No Way Out" ;
    :datePublished "1987-08-14"^^xsd:date ;
    :boxOffice 35509515 ;
    :description "A coverup and witchhunt occur after a politician accidentally kills his mistress." ;
    :genre "Action", "Mystery", "Drama", "Thriller" ;
    :directed :Job_noWayOut_rogerDonaldson ;
    :actedIn :Job_noWayOut_geneHackman, :Job_noWayOut_kevinCostner .

:Title_unforgiven a :Movie ;
    :name "Unforgiven" ;
    :datePublished "1992-08-07"^^xsd:date ;
    :boxOffice 101157447 ;
    :description "Retired gunslinger reluctantly takes on one last job." ;
    :genre "Drama", "Western" ;
    :directed :Job_unforgiven_clintEastwood ;
    :actedIn :Job_unforgiven_morganFreeman, :Job_unforgiven_clintEastwood, :Job_unforgiven_geneHackman .

:Job_noWayOut_rogerDonaldson a :DirectedMovie ;
    :name "Roger Donaldson" ;
    :director :Name_rogerDonaldson .

:Name_rogerDonaldson a :Person .

:Job_unforgiven_clintEastwood a :DirectedMovie ;
    :name "Clint Eastwood" ;
    :director :Name_clintEastwood .

:Job_unforgiven_clintEastwood a :ActedInMovie ;
    :name "Clint Eastwood" ;
    :actor :Name_clintEastwood .

:Name_clintEastwood a :Person .

:Job_noWayOut_geneHackman a :ActedInMovie ;
    :name "Gene Hackman" ;
    :actor :Name_geneHackman .

:Job_unforgiven_geneHackman a :ActedInMovie ;
    :name "Gene Hackman" ;
    :actor :Name_geneHackman .

:Name_geneHackman a :Person .

:Job_noWayOut_kevinCostner a :ActedInMovie ;
    :name "Kevin Costner" ;
    :actor :Name_kevinCostner .

:Name_kevinCostner a :Person .

:Job_unforgiven_morganFreeman a :ActedInMovie ;
    :name "Morgan Freeman" ;
    :actor :Name_morganFreeman .

:Name_morganFreeman a :Person .

Notice there are many IRIs that contain both Movie and Person ids. These scoped IRIs are redundant in this dataset but they serve a purpose when working with denormalized datasources, which is common in NoSQL databases like MongoDB. In this dataset, the name of a Person can appear in both actor and director objects. The name is repeated for every directing or acting job that Person has had. There is no guarantee that a Person’s name is constant across all their jobs, either because the field reflects the name the person had at the time of the job, or because of a problem during an update that led to the inconsistency. Without IRIs that scope a Person to a specific Movie, when you query for the Person’s name, the correct response is a record for every Person/name pair, which can be an expensive query.

See the blog post Mapping Denormalized Data for more details.

Here is the SMS2 mapping for this exercise:

PREFIX : <http://stardog.com/movies/>
MAPPING <urn:movies>
FROM JSON {
  "movie":{
    "_id":"?movieId",
    "name":"?name",
    "datePublished":"?datePublished",
    "genre":["?genre"],
    "boxOffice":"?boxOffice",
    "description":"?description",
    "director":[ {
        "director":"?directorId",
        "name":"?directorName"
      }
    ],
    "actor":[ {
        "actor":"?actorId",
        "name":"?actorName"
      }
    ]
  }
}
TO {
  ?movie a :Movie ;
    :name ?name ;
    :datePublished ?xsdDatePublished ;
    :genre ?genre ;
    :boxOffice "?boxOffice"^^xsd:integer ;
    :description ?description ;
    :directed ?directedMovie ;
    :actedIn ?actedInMovie .

  ?directedMovie a :DirectedMovie ;
    :director ?director ;
    :name ?directorName .

  ?director a :Person .

  ?actedInMovie a :ActedInMovie ;
    :actor ?actor ;
    :name ?actorName .

  ?actor a :Person .
}
WHERE {
  BIND (template("http://stardog.com/movies/Job_{movieId}_{directorId}") AS ?directedMovie)
  BIND (template("http://stardog.com/movies/Job_{movieId}_{actorId}") AS ?actedInMovie)
  BIND (template("http://stardog.com/movies/Title_{movieId}") AS ?movie)
  BIND (template("http://stardog.com/movies/Name_{directorId}") AS ?director)
  BIND (template("http://stardog.com/movies/Name_{actorId}") AS ?actor)
  BIND (xsd:date(?datePublished) AS ?xsdDatePublished)
}

Details of the various FROM formats (SQL, JSON, GraphQL, CSV) are described below in their respective sections.

FROM SQL

The most common option for the FROM clause is FROM SQL, which is for RDBMS datasources and Cassandra. It differs from the JSON and GraphQL source template formats in that for SQL we provide a query in place of a data description. Stardog will interrogate the database schema to determine the field names (which will become variable names) to use for mapping.

To explain the FROM SQL format, recall the SMS mapping from above:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix emp: <http://example.com/emp/> .
@prefix dept: <http://example.com/dept/> .
@prefix sm: <tag:stardog:api:mapping:> .

dept:{"deptno"} a dept:Department ;
    dept:location "{\"loc\"}" ;
    dept:deptno "{\"deptno\"}"^^xsd:integer ;
    sm:map [
      sm:table "DEPT" ;
    ] .

emp:{"empno"} a emp:Employee ;
    emp:name "{\"ename\"}" ;
    emp:role emp:{ROLE} ;
    emp:department dept:{"deptno"} ;
    sm:map [
      sm:query """
        SELECT \"empno\", \"ename\", \"deptno\", (CASE \"job\"
            WHEN 'CLERK' THEN 'general-office'
            WHEN 'NIGHTGUARD' THEN 'security'
            WHEN 'ENGINEER' THEN 'engineering'
        END) AS ROLE FROM \"EMP\"
        """ ;
    ] .

The SMS2 equivalent of this mapping looks like this:

PREFIX emp: <http://example.com/emp/>
PREFIX dept: <http://example.com/dept/>

MAPPING <urn:departments>
FROM SQL {
  SELECT * FROM "DEPT"
}
TO {
  ?deptIri a dept:Department ;
    dept:location ?loc ;
    dept:deptno "?deptno"^^xsd:integer .
}
WHERE {
  BIND (template("http://example.com/dept/{deptno}") AS ?deptIri)
}

;

MAPPING <urn:employees>
FROM SQL {
  SELECT \"empno\", \"ename\", \"deptno\", (CASE \"job\"
    WHEN 'CLERK' THEN 'general-office'
    WHEN 'NIGHTGUARD' THEN 'security'
    WHEN 'ENGINEER' THEN 'engineering'
  END) AS ROLE FROM \"EMP\"
}
TO {
  ?empIri a emp:Employee ;
    emp:name ?ename ;
    emp:role ?roleIri ;
    emp:department ?deptIri .
}
WHERE {
  BIND (template("http://example.com/emp/{empno}") AS ?empIri)
  BIND (template("http://example.com/dept/{deptno}") AS ?deptIri)
  BIND (template("http://example.com/emp/{ROLE}") AS ?roleIri)
}

Note the use of the semicolon to separate multiple mappings, which were necessary because we needed the two SQL statements.

FROM JSON

The structure of the FROM JSON clause resembles the source JSON structure with some changes:

Values are replaced by variable names.
Arrays contain a single element.
Only one JSON document is supplied.
(MongoDB and Cosmos only) There is an outermost key to indicate the name of the collection (movie).

Fields are interpreted as strings unless given a specific data type, by using a cast function in either the WHERE clause as illustrated in the example with the datePublished field, or directly in the TO clause as illustrated by the boxOffice field.

See the initial example which uses FROM JSON in a mapping file for MongoDB.

FROM GraphQL

The FROM GraphQL definition is an alternative format for hierarchical data. It is a Selection Set consisting of Fields, which can be aliased and can contain nested selection sets. By default, each field will be mapped to a variable with the same name as the field. If the field is aliased the alias will serve as the variable name. To identify an array, use an @array directive.

The following mapping uses a FROM GraphQL clause to produce the same results as our prior example that used a FROM JSON clause.

A noteworthy difference between FROM GraphQL and FROM JSON is the order in which source names are replaced with target names. In FROM JSON you can reference the value associated with each “_id” attribute by specifying "_id":"?movieid", the variable is on the right. In FROM GraphQL you do the same by specifying movieId: _id, the variable is on the left.

PREFIX : <http://stardog.com/movies/>
MAPPING <urn:movies>
FROM GraphQL {
  movie {
    movieId: _id
    name
    datePublished
    genre @array
    boxOffice
    description
    director @array {
        directorId: director
        directorName: name
    }
    actor @array {
        actorId: actor
        actorName: name
    }
  }
}
TO {
  ?movie a :Movie ;
    :name ?name ;
    :datePublished ?xsdDatePublished ;
    :genre ?genre ;
    :boxOffice "?boxOffice"^^xsd:integer ;
    :description ?description ;
    :directed ?directedMovie ;
    :actedIn ?actedInMovie .

  ?directedMovie a :DirectedMovie ;
    :director ?director ;
    :name ?directorName .

  ?director a :Person .

  ?actedInMovie a :ActedInMovie ;
    :actor ?actor ;
    :name ?actorName .

  ?actor a :Person .
}
WHERE {
  BIND (template("http://stardog.com/movies/Job_{movieId}_{directorId}") AS ?directedMovie)
  BIND (template("http://stardog.com/movies/Job_{movieId}_{actorId}") AS ?actedInMovie)
  BIND (template("http://stardog.com/movies/Title_{movieId}") AS ?movie)
  BIND (template("http://stardog.com/movies/Name_{directorId}") AS ?director)
  BIND (template("http://stardog.com/movies/Name_{actorId}") AS ?actor)
  BIND (xsd:date(?datePublished) AS ?xsdDatePublished)
}

Note how an array of primitives like genre has the @array directive while an array of objects has the @array directive followed by a selection set. If we wished to map, say, the genre field to a genres variable, we would use an alias, giving this complete line for the genre field: genres: genre @array.

FROM CSV

FROM CSV is used when importing CSV or TSV delimited files;. There is no content in the FROM CSV clause. Either a set of empty braces can follow the FROM CSV clause, or the braces can be omitted.

Overview
R2RML and SMS (Stardog Mapping Syntax)
SMS2 (Stardog Mapping Syntax 2)