Mapping Data Sources
This page discusses mapping data sources in detail to connect to a virtual graph.
Page Contents
Overview
As discussed in the How Virtual Graphs Work, a data mapping file is one of the three components that make up a Virtual Graph (the others being a unique name and properties file).
The mapping specifies how the source data maps to RDF data. With a SQL datasource, this means we must specify which RDF properties and classes correspond with different fields in a row. Stardog’s native mapping syntax is called SMS2. SMS2 supports multiple source data models including SQL, JSON, CSV and GraphQL. The standard R2RML mappings are also supported but do not contain as many features as SMS2. The following table lists which mapping formats are supported for each data source.
Relational Databases
Datasource | R2RML | SMS2 | Source Data Model | Functions in Bind Expressions |
---|---|---|---|---|
* | Yes | Yes | SQL | Type casts |
NoSQL Databases
Datasource | R2RML | SMS2 | Source Data Model | Functions in Bind Expressions |
---|---|---|---|---|
Cassandra | Yes | Yes | SQL | Type casts |
Cosmos DB | No | Yes | JSON/GraphQL | Type casts |
MongoDB | No | Yes | JSON/GraphQL | Type casts |
Elasticsearch | No | Yes | JSON/GraphQL | Type casts |
Static Files
Datasource | R2RML | SMS2 | Source Data Model | Functions in Bind Expressions |
---|---|---|---|---|
CSV | Yes | Yes | CSV | Full support |
JSON | No | Yes | JSON/GraphQL | Full Support |
Support for type cast functions in BIND expressions means only the template
and datatype casting functions are available. Valid datatype casting functions include specific types, eg. xsd:integer(?field)
, and the generic strdt()
function which can also accept a user-defined type. Full support makes all SPARQL functions available.
We recommend using SMS2 for supporting all features.
SMS2 (Stardog Mapping Syntax 2)
Stardog Mapping Syntax 2 (SMS2) is a way to represent virtual graph mappings that is designed to support a broader range of source data formats than R2RML, including semi-structured data sources such as JSON, MongoDB and Elasticsearch, as well as structured formats like SQL RDBMS.
SMS2 is loosely based on the SPARQL CONSTRUCT
query. An abbreviated example looks like this:
PREFIX : <http://stardog.com/movies/>
MAPPING <urn:movies>
FROM JSON {
"movie":{
"_id":"?movieId",
"name":"?name",
}
}
TO {
?movie a :Movie ;
:name ?name .
}
WHERE {
BIND (template("http://stardog.com/movies/Title_{movieId}") AS ?movie)
}
SMS2 consists of five parts: PROLOGUE
, MAPPING
, FROM
, TO
, and WHERE
.
- The
PROLOGUE
is a series of prefix declarations at the beginning of the file. - The
MAPPING
throughWHERE
clauses define a mapping and the set of them can be repeated, separated by a semicolon.- The
MAPPING
clause consists of theMAPPING
keyword followed by an optional IRI for naming the mapping.
- The
- The
FROM
clause describes the source data. TheFROM
clause starts with theFROM
keyword and is followed by a data format keyword.JSON
in this example, but it can beCSV
,GraphQL
, orSQL
. It is followed by a definition that describes the structure of the data and assigns fields to variable names. - The
TO
clause defines how the target RDF should look. It is analogous to theCONSTRUCT
portion of the SPARQLCONSTRUCT
query. It consists of a set of triple patterns. Edge properties are supported. - The
WHERE
clause is where you can transform source data andBIND
the transformed values to new variables. The currently supported functions for use withinBIND
aretemplate
for IRI construction and the cast functions (xsd:string
,xsd:boolean
,xsd:integer
,xsd:float
,xsd:double
,xsd:decimal
,xsd:dateTime
,xsd:date
) for literal value type conversion and thestrdt
function for creation of typed literals from strings. Unlike all other Virtual Graph data sources, theWHERE
clause in SMS2 mappings for text files supports any SPARQL query function whenBIND
-ing transformed values to new variables.
Notice there are no platform-specific query elements (such as MongoDB query syntax) present in the mapping, only descriptions of the source and target data schemas and transformations for mapping the relationship between the source and target.
To help illustrate SMS2, we’ll use the following JSON for a movie
collection:
[{
"_id": "unforgiven",
"name": "Unforgiven",
"datePublished": "1992-08-07T00:00:00.000Z",
"genre": ["Drama", "Western"],
"boxOffice": 101157447,
"description": "Retired gunslinger reluctantly takes on one last job.",
"director": [{
"director": "clintEastwood",
"name": "Clint Eastwood"
}],
"actor": [{
"actor": "morganFreeman",
"name": "Morgan Freeman"
},
{
"actor": "clintEastwood",
"name": "Clint Eastwood"
},
{
"actor": "geneHackman",
"name": "Gene Hackman"
}
]
}, {
"_id": "noWayOut",
"name": "No Way Out",
"datePublished": "1987-08-14T00:00:00.000Z",
"genre": ["Action", "Mystery", "Drama", "Thriller"],
"boxOffice": 35509515,
"description": "A coverup and witchhunt occur after a politician accidentally kills his mistress.",
"director": [{
"director": "rogerDonaldson",
"name": "Roger Donaldson"
}],
"actor": [{
"actor": "geneHackman",
"name": "Gene Hackman"
},
{
"actor": "kevinCostner",
"name": "Kevin Costner"
}
]
}]
Specifics on importing JSON data into Stardog can be found on the page on importing JSON and CSV files.
FROM JSON
also supports MongoDB as well as features specific to it and its Extended JSON syntax. Full details of this functionality can be found on the page about specific data source considerations.
For this example we’ll create mappings that represent the data as this RDF:
@prefix : <http://stardog.com/movies/> .
:Title_noWayOut a :Movie ;
:name "No Way Out" ;
:datePublished "1987-08-14"^^xsd:date ;
:boxOffice 35509515 ;
:description "A coverup and witchhunt occur after a politician accidentally kills his mistress." ;
:genre "Action", "Mystery", "Drama", "Thriller" ;
:directed :Job_noWayOut_rogerDonaldson ;
:actedIn :Job_noWayOut_geneHackman, :Job_noWayOut_kevinCostner .
:Title_unforgiven a :Movie ;
:name "Unforgiven" ;
:datePublished "1992-08-07"^^xsd:date ;
:boxOffice 101157447 ;
:description "Retired gunslinger reluctantly takes on one last job." ;
:genre "Drama", "Western" ;
:directed :Job_unforgiven_clintEastwood ;
:actedIn :Job_unforgiven_morganFreeman, :Job_unforgiven_clintEastwood, :Job_unforgiven_geneHackman .
:Job_noWayOut_rogerDonaldson a :DirectedMovie ;
:name "Roger Donaldson" ;
:director :Name_rogerDonaldson .
:Name_rogerDonaldson a :Person .
:Job_unforgiven_clintEastwood a :DirectedMovie ;
:name "Clint Eastwood" ;
:director :Name_clintEastwood .
:Job_unforgiven_clintEastwood a :ActedInMovie ;
:name "Clint Eastwood" ;
:actor :Name_clintEastwood .
:Name_clintEastwood a :Person .
:Job_noWayOut_geneHackman a :ActedInMovie ;
:name "Gene Hackman" ;
:actor :Name_geneHackman .
:Job_unforgiven_geneHackman a :ActedInMovie ;
:name "Gene Hackman" ;
:actor :Name_geneHackman .
:Name_geneHackman a :Person .
:Job_noWayOut_kevinCostner a :ActedInMovie ;
:name "Kevin Costner" ;
:actor :Name_kevinCostner .
:Name_kevinCostner a :Person .
:Job_unforgiven_morganFreeman a :ActedInMovie ;
:name "Morgan Freeman" ;
:actor :Name_morganFreeman .
:Name_morganFreeman a :Person .
Notice there are many IRIs that contain both Movie and Person ids. These scoped IRIs are redundant in this dataset but they serve a purpose when working with denormalized datasources, which is common in NoSQL databases like MongoDB. In this dataset, the name of a Person can appear in both actor and director objects. The name is repeated for every directing or acting job that Person has had. There is no guarantee that a Person’s name is constant across all their jobs, either because the field reflects the name the person had at the time of the job, or because of a problem during an update that led to the inconsistency. Without IRIs that scope a Person to a specific Movie, when you query for the Person’s name, the correct response is a record for every Person/name pair, which can be an expensive query.
See the blog post Mapping Denormalized Data for more details.
Here is the SMS2 mapping for this exercise:
PREFIX : <http://stardog.com/movies/>
MAPPING <urn:movies>
FROM JSON {
[{
"_id":"?movieId",
"name":"?name",
"datePublished":"?datePublished",
"genre":["?genre"],
"boxOffice":"?boxOffice",
"description":"?description",
"director":[ {
"director":"?directorId",
"name":"?directorName"
}
],
"actor":[{
"actor":"?actorId",
"name":"?actorName"
}]
}]
}
TO {
?movie a :Movie ;
:name ?name ;
:datePublished ?xsdDatePublished ;
:genre ?genre ;
:boxOffice ?boxOfficeNum ;
:description ?description ;
:directed ?directedMovie ;
:actedIn ?actedInMovie .
?directedMovie a :DirectedMovie ;
:director ?director ;
:name ?directorName .
?director a :Person .
?actedInMovie a :ActedInMovie ;
:actor ?actor ;
:name ?actorName .
?actor a :Person .
}
WHERE {
BIND (template("http://stardog.com/movies/Job_{movieId}_{directorId}") AS ?directedMovie)
BIND (template("http://stardog.com/movies/Job_{movieId}_{actorId}") AS ?actedInMovie)
BIND (template("http://stardog.com/movies/Title_{movieId}") AS ?movie)
BIND (template("http://stardog.com/movies/Name_{directorId}") AS ?director)
BIND (template("http://stardog.com/movies/Name_{actorId}") AS ?actor)
BIND (xsd:date(?datePublished) AS ?xsdDatePublished)
BIND (xsd:integer(?boxOffice) as ?boxOfficeNum)
}
Details of the various FROM
formats (SQL
, JSON
, GraphQL
, CSV
) are described below in their respective sections.
FROM SQL
The most common option for the FROM
clause is FROM SQL
, which is for RDBMS datasources and Cassandra. It differs from the JSON
and GraphQL
source template formats in that for SQL
we provide a query in place of a data description. Stardog will interrogate the database schema to determine the field names (which will become variable names) to use for mapping.
We refer to the fields in the query result using variables. These variables can appear in triple patterns in the mapping target as well as BIND expressions to create IRI templates.
PREFIX emp: <http://example.com/emp/>
PREFIX dept: <http://example.com/dept/>
MAPPING <urn:departments>
FROM SQL {
SELECT * FROM "DEPT"
}
TO {
?deptIri a dept:Department ;
dept:location ?loc ;
dept:deptno "?deptno"^^xsd:integer .
}
WHERE {
BIND (template("http://example.com/dept/{deptno}") AS ?deptIri)
}
;
MAPPING <urn:employees>
FROM SQL {
SELECT \"empno\", \"ename\", \"deptno\", (CASE \"job\"
WHEN 'CLERK' THEN 'general-office'
WHEN 'NIGHTGUARD' THEN 'security'
WHEN 'ENGINEER' THEN 'engineering'
END) AS ROLE FROM \"EMP\"
}
TO {
?empIri a emp:Employee ;
emp:name ?ename ;
emp:role ?roleIri ;
emp:department ?deptIri .
}
WHERE {
BIND (template("http://example.com/emp/{empno}") AS ?empIri)
BIND (template("http://example.com/dept/{deptno}") AS ?deptIri)
BIND (template("http://example.com/emp/{ROLE}") AS ?roleIri)
}
Note the use of the semicolon to separate multiple mappings, which were necessary because we needed the two SQL statements.
Databases and Schemas
By default, for JDBC data sources, Stardog will be able to see all the tables that belong to the default schema. Which schema is the default will depend on the JDBC driver. In most cases, if the database name is not specified on the JDBC connection string, the default schema will be determined by the user that creates the connection. You can specify a different default schema by setting the sql.default.schema
data source configuration option.
To access tables that are outside the default schema, you must include those schema names in the sql.schemas
data source configuration option. Setting sql.schemas
to *
will include all schemas in the default catalog (the level above schema or database in a three-level table-naming convention). For databases that support 3-level (catalog/schema/table) table identifiers, e.g., Databricks, setting sql.schemas
to *.*
will include all schemas in all the configured catalogs.
When mapping to a table that is outside the default schema you must reference it using the qualified name (eg. "schema_name"."table_name"
). The schema name is optional for tables in the default schema. Using qualified names for all tables will make the mappings more resilient to changes in the default schema.
Schema Drift
When a data source is added to (or refreshed in) Stardog, the metadata (catalogs, schemas, tables, columns, data types, constraints, row-count estimates) are queried. This initial load can take some time in an environment with very many tables in the configured schemas. Stardog will cache this metadata in the system database so that this querying can be avoided when Stardog is restarted.
Because Stardog caches this metadata, it will not see new or altered tables as these changes are made. See Managing Metadata for details on refreshing this metadata.
FROM JSON
The structure of the FROM JSON
clause resembles the source JSON structure with some changes:
- Values are replaced by variable names.
- Arrays contain a single element.
- Only one JSON document is supplied. The contents with the
FROM JSON { ... }
must match the source data. A JSON “root” can be one of the following:-
Single object:
{ dataset-generated: 2021-03-01, values : [ {x:1}, {x:2} ] }
-
Sequence of objects:
{x:1} {x:2}
-
Array:
[ {x:1}, {x:2} ]
-
- MongoDB and Cosmos only: There is an outermost key to indicate the name of the collection. See the MongoDB section on specific data source considerations.
Fields are interpreted as strings unless given a specific data type, by using a cast function in either the WHERE
clause as illustrated in the example with the datePublished
field, or directly in the TO
clause as illustrated by the boxOffice
field.
See Importing JSON Files for an example of importing a JSON file.
FROM GraphQL
The FROM GraphQL
definition is an alternative format for hierarchical data. It is a Selection Set consisting of Fields, which can be aliased and can contain nested selection sets. By default, each field will be mapped to a variable with the same name as the field. If the field is aliased the alias will serve as the variable name. To identify an array, use an @array
directive.
The following mapping uses a FROM GraphQL
clause to produce the same results as our prior example that used a FROM JSON
clause.
A noteworthy difference between FROM GraphQL
and FROM JSON
is the order in which source names are replaced with target names. In FROM JSON
you can reference the value associated with each “_id” attribute by specifying "_id":"?movieid"
, the variable is on the right. In FROM GraphQL
you do the same by specifying movieId: _id
, the variable is on the left.
PREFIX : <http://stardog.com/movies/>
MAPPING <urn:movies>
FROM GraphQL {
movie {
movieId: _id
name
datePublished
genre @array
boxOffice
description
director @array {
directorId: director
directorName: name
}
actor @array {
actorId: actor
actorName: name
}
}
}
TO {
?movie a :Movie ;
:name ?name ;
:datePublished ?xsdDatePublished ;
:genre ?genre ;
:boxOffice "?boxOffice"^^xsd:integer ;
:description ?description ;
:directed ?directedMovie ;
:actedIn ?actedInMovie .
?directedMovie a :DirectedMovie ;
:director ?director ;
:name ?directorName .
?director a :Person .
?actedInMovie a :ActedInMovie ;
:actor ?actor ;
:name ?actorName .
?actor a :Person .
}
WHERE {
BIND (template("http://stardog.com/movies/Job_{movieId}_{directorId}") AS ?directedMovie)
BIND (template("http://stardog.com/movies/Job_{movieId}_{actorId}") AS ?actedInMovie)
BIND (template("http://stardog.com/movies/Title_{movieId}") AS ?movie)
BIND (template("http://stardog.com/movies/Name_{directorId}") AS ?director)
BIND (template("http://stardog.com/movies/Name_{actorId}") AS ?actor)
BIND (xsd:date(?datePublished) AS ?xsdDatePublished)
}
Note how an array of primitives like genre
has the @array
directive while an array of objects has the @array
directive followed by a selection set. If we wished to map, say, the genre
field to a genres
variable, we would use an alias, giving this complete line for the genre field: genres: genre @array
.
FROM CSV
FROM CSV
is used when importing CSV or TSV delimited files;. There is no content in the FROM CSV
clause. Either a set of empty braces can follow the FROM CSV
clause, or the braces can be omitted.
Edge Properties in SMS2
The SMS2 target specification allows triple patterns with statement annotations. The following example shows how to annotate a statement with a time value from the underlying data source.
MAPPING
FROM SQL {
select * from employees
}
TO {
?emp a emp:Employee .
<<?emp emp:firstName ?first_name>> emp:since "?birth_date"^^xsd:date .
}
WHERE {
BIND(template("http://employee/{emp_no}") as ?emp)
}