Virtual Graphs
This chapter discusses Virtual Graphs - one of Stardog’s primary features for unifying enterprise data. This page mostly discusses the basics of Virtual Graphs. See the Chapter Contents for a short description of what else is included in this chapter. Virtual Graph security is not included in this chapter but is included in the Security chapter.
Page Contents
Overview
Stardog supports a set of techniques for unifying structured enterprise data, chiefly, Virtual Graphs which let you declaratively map data into a Stardog knowledge graph and query it via Stardog in situ.
How it Works
Stardog intelligently rewrites (parts of) SPARQL queries against Stardog into native query syntaxes like SQL, issues the native queries to remote datasources, and then translates the native results into SPARQL results. Virtual Graphs can be used to map both tabular (relational) data from RDBMSs and CSVs as well as semi-structured hierarchical data from NoSQL sources such as MongoDB, Elasticsearch, Cassandra and JSON to RDF.
A Virtual Graph has four components:
- a unique name
- a data source
- a properties file specifying configuration options
- a data mapping file (which can be omitted and automatically generated for most sources)
Connecting to a Virtual Graph
To query a non-materialized Virtual Graph it must first be registered with Stardog. Adding a new virtual graph is done via the following virtual add
CLI command:
$ stardog-admin virtual add dept.properties dept.ttl
When adding a Virtual Graph, Stardog will create a Data Source and establish a connection through it to verify the provided configuration and mappings.
Virtual Graph Properties File
The properties file (dept.properties
in this example) contains all of the configuration for the JDBC data source and virtual graph configuration. It must be in the Java properties file format.
A minimal example (in this case, for MySQL) looks like this:
jdbc.url=jdbc:mysql://localhost/dept
jdbc.username=MySqlUserName
jdbc.password=MyPassword
jdbc.driver=com.mysql.jdbc.Driver
The name of the configuration file without the extension will be the name of the virtual graph. The name of the virtual graph in this example will be dept
as a result. You can override this name using the --name
option.
Stardog does not ship with client drivers. You must add drivers for each data source you want to connect to. See Supported Client Drivers for more information.
The credentials for the JDBC connection need to be provided in plain text. An alternative way to provide credentials is to use the password file mechanism. The credentials should be stored in a password file called services.sdpass
located in STARDOG_HOME
directory. The password file entries are in the format hostname:port:database:username:password
so for the above example there should be an entry like so in this file. Then the credentials in the properties file can be omitted.
localhost:*:dept:MySqlUserName:MyPassword
The properties file can also contain an option called base
to specify a base URI for resolving relative URIs generated by the mappings (if any). If no value is provided, the base URI will be virtual://myGraph
where myGraph
is the name of the virtual graph.
There are many more available configuration options for Virtual Graphs. They are described in the Virtual Graph Configuration section.
Creating a Shared Data Source
In the last example we created the Virtual Graph without supplying the name of an existing Data Source. In that case Stardog automatically creates a “private” Data Source with the same name as the Virtual Graph (in this case <data-source://dept>
) for the dedicated use of this Virtual Graph. To use a “shared” Data Source, we first create the Data Source with the data-source add
CLI command:
$ stardog-admin data-source add dept.properties
Once this Data Source is created we can use it to create a Virtual Graph:
$ stardog-admin virtual add --name dept --data-source dept dept.ttl
All the options in this example applied to the Data Source so the property file was not needed for the virtual add
command. Omitting the property file, however, neccessitated the inclusion of the --name
option.
See Data Sources for more information.
Mapping file
The mapping file (dept.ttl
in this example) contains the mapping from the virtual data source into RDF. The mapping can be in one of three formats:
- SMS, which is the default for the
virtual add
CLI command - Standard R2RML, which is indicated using
--format r2rml
in thevirtual add
CLI command - SMS2 (Stardog Mapping Syntax 2), a syntax that better supports hierarchical datasources like JSON and MongoDB. This is indicated using
--format sms2
in thevirtual add
CLI command
A mapping file is required for data sources without a built-in schema, e.g. some NoSQL databases like MongoDB.
A mapping file is not required if your data has a built-in schema, e.g. MySQL or other relational databases. In this case you can omit a mapping file and the the virtual graph will be automatically mapped using R2RML direct mapping. Omitting a mapping file is most commonly used with one or both of the virtual graph options default.mapping.include.tables
and sql.schemas
to indicate the specific tables to include.
See the detailed documentation about how to create mappings for your data source.
Querying Virtual Graphs
Querying Virtual Graphs is done by using the GRAPH
clause, using a special graph URI in the form virtual://myGraph
to query the Virtual Graph named myGraph
.
The following example shows how to query dept
:
SELECT * {
GRAPH <virtual://dept> {
?person a emp:Employee ;
emp:name "SMITH"
}
}
Virtual graphs can be defined globally in Stardog Server, which is the default, or they can be linked to a specific database when they are created. If a virtual graph is linked to a specific database, it can only be accessed from that database. Attempts to access a linked virtual graph from some other database will result in no data being returned from that virtual graph.
Once a virtual graph is registered, it can be accessed as allowed by access rules.
We can query the local Stardog database and virtual graph’s remote data in a single query. Suppose we have the dept
virtual graph, defined as above, that contains employee and department information, and the Stardog database contains data about the interests of people. We can use the following query to combine the information from both sources:
SELECT * {
GRAPH <virtual://dept> {
?person a emp:Employee ;
emp:name "SMITH" .
}
?person foaf:interest ?interest
}
Or, with Virtual Transparency enabled, the following query will include remote data from the virtual graph as well as from the default graph.
SELECT * {
?person a emp:Employee ;
emp:name "SMITH" .
?person foaf:interest ?interest
}
Query performance will be best if the GRAPH
clause for Virtual Graphs is as selective as possible.
Virtual Graph queries are implemented by executing a query against the remote data source. This is a powerful feature and care must be taken to ensure peak performance. SPARQL and SQL don’t have feature parity, especially given the varying capabilities of SQL implementations. Stardog’s query translator supports most of the salient features of SPARQL including:
- Arbitrarily nested subqueries (including solution modifiers)
- Aggregation
FILTER
(including most SPARQL functions)OPTIONAL
,UNION
,BIND
That said, there are also limitations on translated queries. This includes:
- Duplicate solutions can be returned
- SPARQL
MINUS
is not currently translated to SQL - Comparisons between objects with different datatypes don’t always follow XML Schema semantics
- Named graphs in R2RML are not supported
Importing data from a Virtual Graph
In some cases you need to materialize the information stored in RDBMS directly into RDF. For example, a combination of high network latency, slow-changing data, and strict query performance requirements can make materialization a good fit.
The CLI command virtual import
can be used to import the contents of the RDBMS into Stardog. The command can be used as follows:
$ stardog-admin virtual import myDb dept.properties dept.ttl
This command adds all the mapped triples from the RDBMS into the default graph. Similar to virtual add
, this command assumes SMS by default and can accept R2RML mappings using the --format r2rml
option or SMS2 mappings using the --format sms2
option.
It is also possible to specify a target named graph by using the -g
/--namedGraph
option:
$ stardog-admin virtual import -g http://example.com/targetGraph myDb dept.properties dept.ttl
This virtual import
command is equivalent to the following SPARQL update query:
ADD <virtual://dept> TO <http://example.com/targetGraph>
If the RDBMS contents change over time, and we need to update the materialization results in the future, we can clear the named graph contents and rematerialize again. This can be done by using the --remove-all
option in virtual import
or with the following SPARQL query:
COPY <virtual://dept> TO <http://example.com/targetGraph>
Query performance over materialized graphs will be better as the data will be indexed locally by Stardog, but materialization may not be practical in cases where frequency of change is very high.
Permissions
A user requires WRITE
permission on a database in order to import data into it. If Named Graph Security is enabled, they will also require WRITE
permission on the named graph into which they want to import data. If they are using COPY
or ADD
to import data, they will also need READ
permission on the source virtual graph.
List Registered Virtual Graphs
Registered virtual graphs can be listed using the virtual list
CLI command:
$ stardog-admin virtual list
+----------------|----------|--------+
| Virtual Graphs | Database | Online |
+----------------|----------|--------+
| virtual://dept | * | true |
+----------------|----------|--------+
1 virtual graphs
Notice the *
in the Database
column of the output of the virtual list
command. This indicates that the dept
virtual graph can be used with any database. To associate a virtual graph with a specific database, use the -d <db>
or --database <db>
command-line option with the virtual add
command.
If a virtual graph fails to load during startup it will be listed as offline (Online
= false
). Use the virtual online
command to retry loading an offline virtual graph.
Inspect a Virtual Graph’s Mappings
The CLI command virtual mappings
can be used to retrieve the mappings associated with a virtual graph:
Here’s an example to print the mappings of a registered virtual graph in Stardog Mappings Syntax 2
$ stardog-admin virtual mappings --format sms2 myGraph
Inspect a Virtual Graph’s Properties
The CLI command virtual options
can be used to retrieve the virtual graph properties associated witha virtual graph.
$ stardog-admin virtual options myGraph
Remove a Virtual Graph
Registered virtual graphs can be removed using the virtual remove
command.
$ stardog-admin virtual remove myGraph
Chapter Contents
- Virtual Graph Configuration - discusses configuring virtual graphs
- Data Sources - data source management
- Mapping Data Sources - how to create virtual graph mappings
- Virtual Transparency - discusses Virtual Transparency, a virtual graph facility to query all virtual graphs over the default graph or set of named graphs
- Importing JSON and CSV Files - discusses how to import JSON and CSV files into Stardog
- Optimization - tips for optimizing virtual graphs
- Troubleshooting - tips for troubleshooting virtual graphs