Using Virtual Graphs
Learn how to create and query virtual graphs over relational data in Stardog.
Page Contents
Introduction
In the Virtual Graph Mappings tutorial we learned how to write mappings that map relational data to the Knowledge Graph. In this tutorial we will learn how to create a virtual graph in Stardog using these mappings and how to query the virtual graph using SPARQL as if the data was stored as an RDF graph in Stardog.
Prerequisites
If you would like to follow the steps in this tutorial, there are some initial setup steps to perform as explained in the following sections.
RDBMS Setup
In order to query a virtual graph over relational data we first need to have some relational data. We provide the SQL files necessary to create the tables introduced in the mappings tutorials and insert some sample data into those tables.
The file music_schema.sql creates empty tables and should be executed first. The file beatles.sql populates these tables with a few rows about The Beatles and should be executed next.
Note that these generic SQL files might need to be modified to match the SQL dialect of the RDBMS you are using.
JDBC Setup
Stardog does not ship with JDBC drivers so you need to manually copy the JAR file containing the driver for your RDBMS into the classpath of the Stardog server. The preferred way of doing this is by setting the STARDOG_EXT
environment variable to point to a directory outside the Stardog installation directory and copy the jar file there:
$ export STARDOG_EXT=/opt/stardog/ext
$ mkdir $STARDOG_EXT
$ cp JDBC_JAR_FILE $STARDOG_EXT
Note that the environment variable needs to be set before the Stardog server is started. If the variable was already set before the server is started then the JAR files can be copied to that location while the server is running without a restart.
Database Setup
In order to query a virtual graph (or any graph for that matter), we need to have a database in Stardog. In this tutorial we will start with an empty database since the data will be stored externally in the RDBMS. We can create an empty database in the usual way:
stardog-admin db create -n music
We will set the default namespace for the database upfront so that we don’t have to worry about prefix declarations later on:
stardog namespace add --prefix "" --uri "http://stardog.com/tutorial/" music
Virtual Graph Configuration
In order to create a virtual graph we need a configuration file that will provide the JDBC URL to access the RDBMS along with the required credentials. This file should be in the Java properties file format. An example configuration file for a MySQL database running on the same machine as the Stardog server would look like this:
jdbc.url=jdbc:mysql://localhost/music
jdbc.username=admin
jdbc.password=admin
jdbc.driver=com.mysql.jdbc.Driver
An alternative way to provide credentials is to use the password file mechanism as explained in the Stardog docs. If the password file is used, then credentials can be omitted from the configuration file. There are also additional properties that can be set in the configuration file.
Creating a Virtual Graph
Creating a virtual graph in Stardog is done with the virtual add
command by specifying the configuration file and the mappings file as input parameters:
stardog-admin virtual add music.properties music_mappings.sms
The name of the configuration file is used as the virtual graph name so this command will create a virtual graph named music
. We do not specify the database name in this command because the virtual graphs are defined globally in the Stardog server. Once they are registered with the server, they can be accessed via any Stardog database as allowed by the access rules.
Querying a Virtual Graph
Virtual graphs can be queried using the GRAPH
clause in SPARQL similar to regular named graphs. The IRI of the virtual graph uses a special IRI scheme virtual://
so the IRI for the music virtual graph we created is simply virtual://music
.
The query to retrieve the members of The Beatles from the virtual graph looks like this:
SELECT ?member ?name {
GRAPH <virtual://music> {
?band a :Band ;
:name "The Beatles" ;
:member ?member .
?member :name ?name .
}
}
We can execute this query using the regular query execution command (either by saving the query string in a file or by passing the query string as a parameter):
stardog query execute music beatles_members.sparql
The result for this query will look like this:
member | name |
---|---|
:Artist1 | John Lennon |
:Artist2 | Paul McCartney |
:Artist3 | Ringo Starr |
:Artist4 | George Harrison |
The IRIs for The Beatles members are automatically generated using the template specified in the mappings which appends the id
field from the Artist
table to the artist namespace.
Suppose we execute the following update query that will store the birth date of the Beatles’ members in Stardog natively:
INSERT DATA {
:Artist1 :birthDate "1940-10-09"^^xsd:date .
:Artist2 :birthDate "1942-06-18"^^xsd:date .
:Artist3 :birthDate "1940-07-07"^^xsd:date .
:Artist4 :birthDate "1943-02-25"^^xsd:date .
}
We can write a query that will combine the data from the virtual graph with the birthdate nodes and edges stored directly in Stardog.
SELECT ?member ?name ?birthDate {
GRAPH <virtual://music> {
?band a :Band ;
:name "The Beatles" ;
:member ?member .
?member :name ?name .
}
?member :birthDate ?birthDate
}
In many cases the information about resources will be combined from multiple virtual and local graphs but there won’t be a shared identifier for the entities in different graphs as each data source uses different local identifiers. In this tutorial we simply reused the identifier from our single virtual graph to assert additional information manually, but in real world applications more sophisticated methods are needed to link entities from different data sources. We will discuss these techniques in a future tutorial.
We can run more complicated queries involving filters or aggregation over virtual graphs and the corresponding elements will be automatically translated to SQL by Stardog behind the scenes. For example, we can compute the number of albums released every year:
SELECT ?year (count(?album) AS ?count)
FROM <virtual://music> {
?album a :Album ;
:date ?date ;
BIND (year(?date) AS ?year)
}
GROUP By ?year
For this query we used the FROM
clause instead of the GRAPH
clause which is another way to query named graphs in SPARQL and works with virtual graphs too.
We can look at the query plan Stardog generates to see the exact SQL generated by Stardog:
$ stardog query explain music albums_by_year.sparql
...
The Query Plan:
From <virtual://music>
Projection(?year, ?count) [#1]
`─ VirtualGraphSql<virtual://music> [#1] {
+─ SELECT YEAR(`release_date`) AS `BIND_year`, COUNT(*) AS `count`
+─ FROM `Album`
+─ WHERE `release_date` IS NOT NULL
+─ GROUP BY YEAR(`release_date`)
}
As the query plan shows, Stardog pushes functions and aggregation to the SQL to execute the query efficiently.
Importing a Virtual Graph
In some cases you need to materialize the information stored in an RDBMS directly into Stardog. In one bulk operation all the contents of the relational data source can be mapped to RDF and added into the Stardog database. In some cases query performance over materialized graphs will be better as the data will be indexed locally by Stardog, but materialization may not be practical in cases where frequency of change is very high.
If the mappings for the virtual graph are already registered, then a regular SPARQL update query can be used for importing the virtual graph:
COPY <virtual://music> TO :importGraph
The COPY
query clears the contents of the target graph :importGraph
first so if the RDBMS contents change over time this command can be used to repopulate the materialized graph in the future. If the import operation will be done only once then the virtual import
command can be used to import the contents of the RDBMS into Stardog without registering the virtual graph first.
Conclusion
Stardog’s Virtual Graph capability provides extreme flexibility by exposing relational data as a graph without moving the data out of the RDBMS. In cases where this is not possible due to performance or quality of service constraints, the same logical mappings used for virtualization can be used to materialize the data in Stardog allowing users to easily switch between two different modes of operation.