Data Sources
This page discusses Data Sources, how they are used by Virtual Graphs, and how to manage them.
Page Contents
Overview
Virtual Graphs use Data Sources as a source of connections to external (to Stardog) data resources and of metadata (table name, columns, data types, keys, row counts, etc.) describing those resources.
Having Virtual Graphs broken into two separately managed resources has these advantages:
- Multiple Virtual Graphs can share the same Data Source
- Administrators can manage connections (e.g. max connections limit) per data source
- The management of the Data Source and Virtual Graphs can be granted to different security roles
- Data Source metadata can be managed separately from individual Virtual Graphs
Creating and Managing Data Sources
To create a Data Source, register it with the data-source add
CLI command:
$ stardog-admin data-source add dept.properties
In the above example we are using the same properties file from the Virtual Graph example. The support for password files applies to Data Sources as well.
Use the data-source list
CLI command to view a list of registered data sources:
$ stardog-admin data-source list
+--------------------+--------+--------+
| Data Source | Shared | Online |
+--------------------+--------+--------+
| data-source://dept | true | true |
+--------------------+--------+--------+
The data-source options
CLI command will return the properties that were used when creating the Data Source:
$ stardog-admin data-source options dept
jdbc.url=jdbc:mysql://localhost/dept
jdbc.username=MySqlUserName
jdbc.password=MyPassword
jdbc.driver=com.mysql.jdbc.Driver
To remove a Data Source, use the data-source remove
command:
$ stardog-admin data-source remove dept
Cannot remove data source 'dept' without the force option because it is in use by virtual graph: dept
As illustrated, the data-source remove
command will fail if the Data Source is in use by any Virtual Graphs. Running the data-source remove
command with the --force
option will remove the Data Source as well as all its dependent Virtual Graphs (use with caution):
$ stardog-admin data-source remove --force dept
Successfully removed data source dept
Private and Shared Data Sources
Every Virtual Graph uses one Data Source. While the Stardog APIs allow the creation of a Virtual Graph without explicitly naming a Data Source, when that is done Stardog automatically creates a “private” Data Source with the same local name as the Virtual Graph for the sole use of that Virtual Graph. The life cycle of a private Data Source is tied to the life cycle of its dependent Virtual Graph – when the Virtual Graph is removed the private Data Source is automatically removed as well.
Private Data Sources can be converted to shared Data Sources with the data-source share
CLI command:
$ stardog-admin data-source share dept
Stardog Studio always creates shared Data Sources.
Managing Metadata
When a Virtual Graph loads, it interrogates its Data Source for the names of all the visible schemas and tables as well as the names and data types of columns, primary keys and other constraints, and row count estimates for all the tables that are referenced in the mappings of the Virtual Graph. In a large enterprise, this process can take considerable time so Stardog saves this metadata with the Data Source.
Stardog uses this saved metadata during query translation and optimization. If any of this metadata changes after the Data Source has saved it, those changes will not be visible to the loaded Virtual Graphs. This phenomenon is known as schema drift.
Refreshing Metadata
The data-source refresh-metadata
CLI command is used to clear the saved metadata for a Data Source and reload all its dependent Virtual Graphs with fresh metadata.
$ stardog-admin data-source refresh-metadata dept
Data Sources load metadata on demand, or lazily. This is to reduce as much as possible the load between Stardog and the external data resources.
Metadata caching is currently supported for JDBC Data Sources only.
The refresh-metadata
command will not load tables from new databases because seeing them requires updating the sql.schemas
data source configuration option. Use the data-source add command with the --overwrite
option to refresh the data source configuration options.
Refreshing Row-Count Estimates
There’s one element of the metadata that is saved with a Data Source that is more likely to change than the other types – row-count estimates. Row-count estimates are expected to change whenever data is inserted or deleted from a data resource. While this type of change is distinct from general schema drift, it can affect query optimization, so keeping it up to date is important. The data-source refresh-counts
CLI command satisfies this requirement:
$ stardog-admin data-source refresh-counts dept
Data Source and Virtual Graph Availability
Both Virtual Graphs and Data Sources can encounter errors, either when initially created or when recreated at startup. A Data Source can fail to load as a result of connection problems. Likewise, a Virtual Graph can fail when the metadata for a Data Source is refreshed and the new schema is not compatible with the mappings. When these errors occur, Stardog will mark the resource as “Unavailable”, which is like an offline mode. It provides a way for the resource to continue to appear as a resource while at the same time providing an indication that there is a problem with it.
Once a Data Source or Virtual Graph is marked as unavailable, it will stay unavailable until Stardog is restarted or the data-source online
CLI command or virtual online
CLI command, repectively, is run:
$ stardog-admin data-source online dept
$ stardog-admin virtual online dept
Chapter Contents
- Supported Data Sources
- Data Source Configuration
- REST Connector Configuration
- Supported Client Drivers
- Specific Data Source Considerations