Mapping Data Sources

This page discusses mapping data sources in detail to connect to a virtual graph.

Page Contents

Overview
SMS2 (Stardog Mapping Syntax 2)

Overview

As discussed in the How Virtual Graphs Work section, a data mapping file is one of the three components that make up a Virtual Graph (the others being a unique name and properties file).

The mapping specifies how the source data maps to RDF data. With a SQL data source, this means we must specify which RDF properties and classes correspond with different fields in a row. Stardog’s native mapping syntax is called SMS2. SMS2 supports multiple source data models, including SQL, JSON, CSV, and GraphQL. The standard R2RML mappings are also supported but do not contain as many features as SMS2. The following table lists which mapping formats are supported for each data source.

Relational Databases

Datasource	R2RML	SMS2	Source Data Model	Functions in Bind Expressions
*	Yes	Yes	SQL	Type casts

NoSQL Databases

Datasource	R2RML	SMS2	Source Data Model	Functions in Bind Expressions
Cassandra	Yes	Yes	SQL	Type casts
Cosmos DB	No	Yes	JSON/GraphQL	Type casts
MongoDB	No	Yes	JSON/GraphQL	Type casts
Elasticsearch	No	Yes	JSON/GraphQL	Type casts

Static Files

Datasource	R2RML	SMS2	Source Data Model	Functions in Bind Expressions
CSV	Yes	Yes	CSV	Full support
JSON	No	Yes	JSON/GraphQL	Full Support

Support for type cast functions in BIND expressions means only the template and datatype casting functions are available. Valid datatype casting functions include specific types, e.g., xsd:integer(?field), and the generic strdt() function, which can also accept a user-defined type. Full support makes all SPARQL functions available.

We recommend using SMS2 for supporting all features.

SMS2 (Stardog Mapping Syntax 2)

Stardog Mapping Syntax 2 (SMS2) is a way to represent virtual graph mappings that is designed to support a broader range of source data formats than R2RML, including semi-structured data sources such as JSON, MongoDB, and Elasticsearch, as well as structured formats like SQL RDBMS.

SMS2 is loosely based on the SPARQL CONSTRUCT query. An abbreviated example looks like this:

PREFIX : <http://stardog.com/movies/>

MAPPING <urn:movies>
FROM JSON {
  "movie":{
    "_id":"?movieId",
    "name":"?name",
  }
}
TO {
  ?movie a :Movie ;
    :name ?name .
}
WHERE {
  BIND (template("http://stardog.com/movies/Title_{movieId}") AS ?movie)
}

SMS2 consists of five parts: PROLOGUE, MAPPING, FROM, TO, and WHERE.

The PROLOGUE is a series of prefix declarations at the beginning of the file.
The MAPPING through WHERE clauses define a mapping, and the set of them can be repeated, separated by a semicolon.
1. The MAPPING clause consists of the MAPPING keyword, followed by an optional IRI for naming the mapping.
The FROM clause describes the source data. The FROM clause starts with the FROM keyword and is followed by a data format keyword. JSON in this example, but it can be CSV, GraphQL , or SQL. It is followed by a definition that describes the structure of the data and assigns fields to variable names.
The TO clause defines how the target RDF should look. It is analogous to the CONSTRUCT portion of the SPARQL CONSTRUCT query. It consists of a set of triple patterns. Edge properties are supported.
The WHERE clause is where you can transform source data and BIND the transformed values to new variables. The currently supported functions for use within BIND are template for IRI construction and the cast functions (xsd:string, xsd:boolean, xsd:integer, xsd:float, xsd:double, xsd:decimal, xsd:dateTime, xsd:date) for literal value type conversion and the strdt function for creation of typed literals from strings. Unlike all other Virtual Graph data sources, the WHERE clause in SMS2 mappings for text files supports any SPARQL query function when BIND-ing transformed values to new variables.

Notice there are no platform-specific query elements (such as MongoDB query syntax) present in the mapping, only descriptions of the source and target data schemas and transformations for mapping the relationship between the source and target.

To help illustrate SMS2, we’ll use the following JSON for a movie collection:

[{
	"_id": "unforgiven",
	"name": "Unforgiven",
	"datePublished": "1992-08-07T00:00:00.000Z",
	"genre": ["Drama", "Western"],
	"boxOffice": 101157447,
	"description": "Retired gunslinger reluctantly takes on one last job.",
	"director": [{
		"director": "clintEastwood",
		"name": "Clint Eastwood"
	}],
	"actor": [{
			"actor": "morganFreeman",
			"name": "Morgan Freeman"
		},
		{
			"actor": "clintEastwood",
			"name": "Clint Eastwood"
		},
		{
			"actor": "geneHackman",
			"name": "Gene Hackman"
		}
	]
}, {
	"_id": "noWayOut",
	"name": "No Way Out",
	"datePublished": "1987-08-14T00:00:00.000Z",
	"genre": ["Action", "Mystery", "Drama", "Thriller"],
	"boxOffice": 35509515,
	"description": "A coverup and witchhunt occur after a politician accidentally kills his mistress.",
	"director": [{
		"director": "rogerDonaldson",
		"name": "Roger Donaldson"
	}],
	"actor": [{
			"actor": "geneHackman",
			"name": "Gene Hackman"
		},
		{
			"actor": "kevinCostner",
			"name": "Kevin Costner"
		}
	]
}]

Specifics on importing JSON data into Stardog can be found on the page on importing JSON and CSV files.

FROM JSON also supports MongoDB as well as features specific to it and its Extended JSON syntax. Full details of this functionality can be found on the page about specific data source considerations.

For this example we’ll create mappings that represent the data as this RDF:

@prefix : <http://stardog.com/movies/> .

:Title_noWayOut a :Movie ;
    :name "No Way Out" ;
    :datePublished "1987-08-14"^^xsd:date ;
    :boxOffice 35509515 ;
    :description "A coverup and witchhunt occur after a politician accidentally kills his mistress." ;
    :genre "Action", "Mystery", "Drama", "Thriller" ;
    :directed :Job_noWayOut_rogerDonaldson ;
    :actedIn :Job_noWayOut_geneHackman, :Job_noWayOut_kevinCostner .

:Title_unforgiven a :Movie ;
    :name "Unforgiven" ;
    :datePublished "1992-08-07"^^xsd:date ;
    :boxOffice 101157447 ;
    :description "Retired gunslinger reluctantly takes on one last job." ;
    :genre "Drama", "Western" ;
    :directed :Job_unforgiven_clintEastwood ;
    :actedIn :Job_unforgiven_morganFreeman, :Job_unforgiven_clintEastwood, :Job_unforgiven_geneHackman .

:Job_noWayOut_rogerDonaldson a :DirectedMovie ;
    :name "Roger Donaldson" ;
    :director :Name_rogerDonaldson .

:Name_rogerDonaldson a :Person .

:Job_unforgiven_clintEastwood a :DirectedMovie ;
    :name "Clint Eastwood" ;
    :director :Name_clintEastwood .

:Job_unforgiven_clintEastwood a :ActedInMovie ;
    :name "Clint Eastwood" ;
    :actor :Name_clintEastwood .

:Name_clintEastwood a :Person .

:Job_noWayOut_geneHackman a :ActedInMovie ;
    :name "Gene Hackman" ;
    :actor :Name_geneHackman .

:Job_unforgiven_geneHackman a :ActedInMovie ;
    :name "Gene Hackman" ;
    :actor :Name_geneHackman .

:Name_geneHackman a :Person .

:Job_noWayOut_kevinCostner a :ActedInMovie ;
    :name "Kevin Costner" ;
    :actor :Name_kevinCostner .

:Name_kevinCostner a :Person .

:Job_unforgiven_morganFreeman a :ActedInMovie ;
    :name "Morgan Freeman" ;
    :actor :Name_morganFreeman .

:Name_morganFreeman a :Person .

Notice there are many IRIs that contain both Movie and Person ids. These scoped IRIs are redundant in this dataset, but they serve a purpose when working with denormalized datasources, which is common in NoSQL databases like MongoDB. In this dataset, the name of a Person can appear in both actor and director objects. The name is repeated for every directing or acting job that Person has had. There is no guarantee that a Person’s name is constant across all their jobs, either because the field reflects the name the person had at the time of the job, or because of a problem during an update that led to the inconsistency. Without IRIs that scope a Person to a specific Movie, when you query for the Person’s name, the correct response is a record for every Person/name pair, which can be an expensive query.

See the blog post Mapping Denormalized Data for more details.

Here is the SMS2 mapping for this exercise:

PREFIX : <http://stardog.com/movies/>
MAPPING <urn:movies>
FROM JSON {
  [{
    "_id":"?movieId",
    "name":"?name",
    "datePublished":"?datePublished",
    "genre":["?genre"],
    "boxOffice":"?boxOffice",
    "description":"?description",
    "director":[ {
      "director":"?directorId",
      "name":"?directorName"
    }
    ],
    "actor":[{
      "actor":"?actorId",
      "name":"?actorName"
    }]
  }]
}
TO {
  ?movie a :Movie ;
    :name ?name ;
    :datePublished ?xsdDatePublished ;
    :genre ?genre ;
    :boxOffice ?boxOfficeNum ;
    :description ?description ;
    :directed ?directedMovie ;
    :actedIn ?actedInMovie .

  ?directedMovie a :DirectedMovie ;
    :director ?director ;
    :name ?directorName .

  ?director a :Person .

  ?actedInMovie a :ActedInMovie ;
    :actor ?actor ;
    :name ?actorName .

  ?actor a :Person .
}
WHERE {
  BIND (template("http://stardog.com/movies/Job_{movieId}_{directorId}") AS ?directedMovie)
  BIND (template("http://stardog.com/movies/Job_{movieId}_{actorId}") AS ?actedInMovie)
  BIND (template("http://stardog.com/movies/Title_{movieId}") AS ?movie)
  BIND (template("http://stardog.com/movies/Name_{directorId}") AS ?director)
  BIND (template("http://stardog.com/movies/Name_{actorId}") AS ?actor)
  BIND (xsd:date(?datePublished) AS ?xsdDatePublished)
  BIND (xsd:integer(?boxOffice) as ?boxOfficeNum)
}

Details of the various FROM formats (SQL, JSON, GraphQL, CSV) are described below in their respective sections.

FROM SQL

The most common option for the FROM clause is FROM SQL, which is for RDBMS datasources and Cassandra. It differs from the JSON and GraphQL source template formats in that, for SQL, we provide a query in place of a data description. Stardog will interrogate the database schema to determine the field names (which will become variable names) to use for mapping.

We refer to the fields in the query result using variables. These variables can appear in triple patterns in the mapping target, as well as BIND expressions, to create IRI templates.

PREFIX emp: <http://example.com/emp/>
PREFIX dept: <http://example.com/dept/>

MAPPING <urn:departments>
FROM SQL {
  SELECT * FROM "DEPT"
}
TO {
  ?deptIri a dept:Department ;
    dept:location ?loc ;
    dept:deptno "?deptno"^^xsd:integer .
}
WHERE {
  BIND (template("http://example.com/dept/{deptno}") AS ?deptIri)
}

;

MAPPING <urn:employees>
FROM SQL {
  SELECT \"empno\", \"ename\", \"deptno\", (CASE \"job\"
    WHEN 'CLERK' THEN 'general-office'
    WHEN 'NIGHTGUARD' THEN 'security'
    WHEN 'ENGINEER' THEN 'engineering'
  END) AS ROLE FROM \"EMP\"
}
TO {
  ?empIri a emp:Employee ;
    emp:name ?ename ;
    emp:role ?roleIri ;
    emp:department ?deptIri .
}
WHERE {
  BIND (template("http://example.com/emp/{empno}") AS ?empIri)
  BIND (template("http://example.com/dept/{deptno}") AS ?deptIri)
  BIND (template("http://example.com/emp/{ROLE}") AS ?roleIri)
}

Note the use of the semicolon to separate multiple mappings, which were necessary because we needed the two SQL statements.

Databases and Schemas

By default, for JDBC data sources, Stardog will be able to see all the tables that belong to the default schema. Which schema is the default will depend on the JDBC driver. In most cases, if the database name is not specified on the JDBC connection string, the default schema will be determined by the user that creates the connection. You can specify a different default schema by setting the sql.default.schema data source configuration option.

To access tables that are outside the default schema, you must include those schema names in the sql.schemas data source configuration option. Setting sql.schemas to * will include all schemas in the default catalog (the level above schema or database in a three-level table-naming convention). For databases that support 3-level (catalog/schema/table) table identifiers, e.g., Databricks, setting sql.schemas to *.* will include all schemas in all the configured catalogs.

When mapping to a table that is outside the default schema, you must reference it using the qualified name (e.g., "schema_name"."table_name"). The schema name is optional for tables in the default schema. Using qualified names for all tables will make the mappings more resilient to changes in the default schema.

Schema Drift

When a data source is added to (or refreshed in) Stardog, the metadata (catalogs, schemas, tables, columns, data types, constraints, row-count estimates) are queried. This initial load can take some time in an environment with very many tables in the configured schemas. Stardog will cache this metadata in the system database so that this querying can be avoided when Stardog is restarted.

Because Stardog caches this metadata, it will not see new or altered tables as these changes are made. See Managing Metadata for details on refreshing this metadata.

FROM JSON

The structure of the FROM JSON clause resembles the source JSON structure with some changes:

Values are replaced by variable names.
Arrays contain a single element.
Only one JSON document is supplied. The contents with the FROM JSON { ... } must match the source data. A JSON “root” can be one of the following:
1. Single object:
```
 { 
   dataset-generated: 2021-03-01, 
   values : [ {x:1}, {x:2} ] 
 }
```
2. Sequence of objects:
```
 {x:1}
 {x:2}
```
3. Array:
```
 [
   {x:1},
   {x:2}
 ]
```
MongoDB and Cosmos only: There is an outermost key to indicate the name of the collection. See the MongoDB section on specific data source considerations.

Fields are interpreted as strings unless given a specific data type, by using a cast function in either the WHERE clause, as illustrated in the example with the datePublished field, or directly in the TO clause, as illustrated by the boxOffice field.

See Importing JSON Files for an example of importing a JSON file.

FROM GraphQL

The FROM GraphQL definition is an alternative format for hierarchical data. It is a Selection Set consisting of Fields, which can be aliased and can contain nested selection sets. By default, each field will be mapped to a variable with the same name as the field. If the field is aliased, the alias will serve as the variable name. To identify an array, use an @array directive.

The following mapping uses a FROM GraphQL clause to produce the same results as our prior example that used a FROM JSON clause.

A noteworthy difference between FROM GraphQL and FROM JSON is the order in which source names are replaced with target names. In FROM JSON, you can reference the value associated with each “_id” attribute by specifying "_id":"?movieid" – the variable is on the right. In FROM GraphQL, you do the same by specifying movieId: _id – the variable is on the left.

PREFIX : <http://stardog.com/movies/>
MAPPING <urn:movies>
FROM GraphQL {
  movie {
    movieId: _id
    name
    datePublished
    genre @array
    boxOffice
    description
    director @array {
        directorId: director
        directorName: name
    }
    actor @array {
        actorId: actor
        actorName: name
    }
  }
}
TO {
  ?movie a :Movie ;
    :name ?name ;
    :datePublished ?xsdDatePublished ;
    :genre ?genre ;
    :boxOffice "?boxOffice"^^xsd:integer ;
    :description ?description ;
    :directed ?directedMovie ;
    :actedIn ?actedInMovie .

  ?directedMovie a :DirectedMovie ;
    :director ?director ;
    :name ?directorName .

  ?director a :Person .

  ?actedInMovie a :ActedInMovie ;
    :actor ?actor ;
    :name ?actorName .

  ?actor a :Person .
}
WHERE {
  BIND (template("http://stardog.com/movies/Job_{movieId}_{directorId}") AS ?directedMovie)
  BIND (template("http://stardog.com/movies/Job_{movieId}_{actorId}") AS ?actedInMovie)
  BIND (template("http://stardog.com/movies/Title_{movieId}") AS ?movie)
  BIND (template("http://stardog.com/movies/Name_{directorId}") AS ?director)
  BIND (template("http://stardog.com/movies/Name_{actorId}") AS ?actor)
  BIND (xsd:date(?datePublished) AS ?xsdDatePublished)
}

Note how an array of primitives like genre has the @array directive, while an array of objects has the @array directive followed by a selection set. If we wished to map, say, the genre field to a genres variable, we would use an alias, giving this complete line for the genre field: genres: genre @array.

FROM CSV

FROM CSV is used when importing CSV or TSV delimited files. There is no content in the FROM CSV clause. Either a set of empty braces can follow the FROM CSV clause, or the braces can be omitted.

Edge Properties in SMS2

The SMS2 target specification allows triple patterns with statement annotations. The following example shows how to annotate a statement with a time value from the underlying data source.

MAPPING
FROM SQL {
   select * from employees
}
TO {
  ?emp a emp:Employee .

  <<?emp emp:firstName ?first_name>> emp:since "?birth_date"^^xsd:date .
}
WHERE {
  BIND(template("http://employee/{emp_no}") as ?emp)
}

Overview
SMS2 (Stardog Mapping Syntax 2)