Link Search Menu Expand Document
Start for Free

Database Administration

This chapter discusses administering Stardog databases. See Chapter Contents for what’s included in this chapter. This page covers some of the basics for managing Stardog databases such as creating a database.

Page Contents
  1. Creating a Database
    1. Database Creation Templates
    2. Optimizing Bulk Data Loading
    3. Archetypes
      1. Inline Archetypes
      2. Protected Archetypes
      3. Built-in Archetypes
  2. Database Status
  3. Namespaces
  4. Adding Data
  5. Loading Compressed Data
  6. Dropping a Database
  7. Transactions
  8. Chapter Contents

Creating a Database

Stardog databases may be created locally or remotely; but performance is better if data files don’t have to be transferred over a network during creation and initial loading. See the section below about loading compressed data. All data files, indexes, and server metadata for the new database will be stored in Stardog Home.

Stardog won’t create a database with the same name as an existing database. Stardog database names must start with an alpha character followed by zero or more alphanumeric, hyphen or underscore characters, as is given by the regular expression [A-Za-z]{1}[A-Za-z0-9_-]*.

There are four reserved words that may not be used for the names of Stardog databases: system, admin, and docs.

Minimally, the only thing you must know to create a Stardog database is a database name; alternately, you may customize some other database parameters and options depending on anticipated workloads, data modeling, and other factors.

See the stardog-admin db create manual page for further details and examples.

Database Creation Templates

As a boon to the overworked admin or devops peeps, Stardog Server supports database creation templates: you can pass a Java Properties file with config values set and with the values (typically just the database name) that are unique to a specific database passed in CLI parameters.

EXAMPLE

To create a new database with the default options by simply providing a name and a set of initial datasets to load:

$ stardog-admin db create -n myDb input.ttl another_file.rdf moredata.rdf.gz

Datasets can be loaded later as well.


EXAMPLE

To create (in this case, an empty) database from a template file:

$ stardog-admin db create -c database.properties

At a minimum, the configuration file (database.properties in the above example) must have a value for database.name option:

database.name = myDb

Configuring the database is discussed in the Database Configuration page.


EXAMPLE

If you only want to change only a few configuration options you can directly give the values for these options in the CLI args as follows:

$ stardog-admin db create -n myDb -o icv.enabled=true icv.reasoning.enabled=true -- input.ttl

-- is used in this case when -o is the last option to delimit the value for -o from the files to be bulk loaded.

Optimizing Bulk Data Loading

Stardog tries hard to do bulk loading at database creation time in the most efficient and scalable way possible. Data loading time can vary widely, depending on factors in the data to be loaded, including the number of unique resources, etc. Here are some tuning tips that may work for you:

  1. Use the bulk_load memory configuration for loading large databases (see Memory Configuration section for further details).
  2. Load compressed data since compression minimizes disk access .
  3. Use a multicore machine since bulk loading is highly parallelized and indexes are built concurrently .
  4. Load many files together at creation time since different files will be parsed and processed concurrently improving the load speed .
  5. Turn off the database option strict parsing.

Archetypes

A database archetype is a simple templating mechanism for bundling a set of namespaces, schemas and constraints to populate a newly created database. Archetypes are an easy way to register the namespaces, reasoning schemas and constraints for standardized vocabularies and ontologies with a database. Archetypes are composable so multiple archetypes can be specified at database creation time to load all the defined namespaces, schemas and constraints into the database. Archetypes are intended to be used alongside your domain data, which may include as many other schemas and constraints as are required.

As of Stardog 7.2.0, the preferred way of using archetypes is via the Stardog Archetype Repository which comes with archetypes for FOAF, SKOS, PROV, and CIM. Follow the instructions on the GitHub repository for setting up and using archetypes.

Once the archetypes have been setup you can use the following command to create a new database that will load the namespaces, schemas and constraints associated with an archetype:

$ stardog-admin db create -o database.archetypes="cim" -n myDb

Inline Archetypes

Archetypes can be used as a predefined way of loading a schema and a set of constraints to the database just like any RDF data can be loaded to a database. These kinds of archetypes are called “inline” as their contents will appear in the database under predefined named graphs as explained next. These named graphs that are automatically created by archetypes can be queried and modified by the user as any other named graph.

Each archetype has a unique IRI identifying it and the schema contents of inline archetypes will be loaded into a named graph with that IRI. To see an example, follow the setup instructions to download the archetypes to ${STARDOG_HOME}/.archetypes and create a new database with the FOAF archetype:

$ stardog-admin db create -o database.archetypes="foaf" myDb

If you query the database you will see a named graph automatically created:

$ stardog query myDb "select distinct ?g { graph ?g { } }"
+----------------------------+
|             g              |
+----------------------------+
| http://xmlns.com/foaf/0.1/ |
+----------------------------+

Protected Archetypes

Archetypes can also be defined in a “protected” mode where the schema and the constraints will be available for reasoning and validation services but they will not be stored in the database. In this mode, archetypes prevent unintended modifications to the schema and the constraints without losing their reasoning and validation functionality. An ontology like PROV is standardized by W3C and is not meant to change over time so the protected mode can be used with it.

The user-defined archetypes are inline by default but the archetype definition can be configured to make the schema and/or the constraints protected as explained in the Github Repository.

The following example shows how using a protected archetype would look:

$ stardog-admin db create -o database.archetypes="prov" -n provDB
Successfully created database 'provDB'.

$ stardog query provDB "select distinct ?g { graph ?g { } }"
+-------+
|   g   |
+-------+
+-------+

$ stadog reasoning schema provDB
prov:wasDerivedFrom a owl:ObjectProperty
prov:wasGeneratedBy owl:propertyChainAxiom (prov:qualifiedGeneration prov:activity)
prov:SoftwareAgent a owl:Class
prov:wasInfluencedBy rdfs:domain (prov:Activity or prov:Agent or prov:Entity)
...

$ stardog query --reasoning provDB "select * { ?cls rdfs:subClassOf prov:Agent }"
+--------------------+
|        cls         |
+--------------------+
| prov:Agent         |
| prov:SoftwareAgent |
| owl:Nothing        |
| prov:Person        |
| prov:Organization  |
+--------------------+

$ stardog icv export provDB
AxiomConstraint{prov:EmptyCollection rdfs:subClassOf (prov:hadMember max 0 owl:Thing)}
AxiomConstraint{prov:Entity owl:disjointWith prov:Derivation}
SPARQLConstraint{
...

This example demonstrates that the database looks empty to regular SPARQL queries but reasoning queries see the PROV ontology. Similarly PROV constraints are visible for validation purposes but they cannot be removed by the icv drop command.

Built-in Archetypes

Before Stardog 7.2.0, the only way to define archetypes was by creating and registering a new Java class that contained the archetype definition. This method is deprecated as of Stardog 7.2.0 but it will continue to work until Stardog 8 at which point support for Java-based archetypes will be removed. Until that time, the Java-based PROV and SKOS archetypes that were bundled in the Stardog distribution as built-in archetypes will be available and can be used without setting up the archetype location as describe above.

Database Status

Databases are either online or offline; this allows database maintenance to be decoupled from server maintenance.

Databases are put online or offline synchronously: these operations block until other database activity is completed or terminated.

To set a database from online to offline from the CLI:

$ stardog-admin db offline myDb

To set the database online:

$ stardog-admin db online myDb

If the Stardog server is shutdown while a database is offline, the database will be offline when the server restarts.

See the db online and db offline man pages for further information on these CLI commands.

Namespaces

Stardog allows database administrators to persist and manage custom namespace prefix bindings.

At database creation time, if data is loaded to the database that has namespace prefixes, then those are persisted for the life of the database. This includes setting the default namespace to the default that appears in the file. Any subsequent queries to the database may simply omit the PREFIX declarations:

$ stardog query myDB "select * { ?s rdf:type owl:Class }"

To add new bindings, use the namespace subcommand in the CLI:

$ stardog namespace add myDb --prefix ex --uri 'http://example.org/test#'

To change the default binding, use a quote prefix when adding a new one:

$ stardog namespace add myDb --prefix "" --uri http://new.default

To change an existing binding, delete the existing one and then add a new one:

$ stardog namespace remove myDb --prefix ex

Finally, to see all the existing namespace prefix bindings:

$ stardog namespace list myDB

If no files are used during database creation, or if the files do not define any prefixes (e.g. NTriples), then the “Big Four” default prefixes are stored: RDF, RDFS, XSD, and OWL.

When executing queries in the CLI, the default table format for SPARQL SELECT results will use the bindings as qnames. SPARQL CONSTRUCT query output (including export) will also use the stored prefixes. To reiterate, namespace prefix bindings are per database, not global.

Adding Data

As mentioned earlier in the Creating a Database section, you can choose to supply data files to load at database creation time. Data can also be added later in a single commit. The data add operation is atomic: if multiple files are being added to the database and there is an error adding one or more of the files, the entire operation will be rolled back.

We can load data into a Stardog database via the command line with the data add command.

EXAMPLE

Add data in the Turtle format to the default graph:

$ stardog data add myDb file.ttl

EXAMPLE

Add data to a specific named graph:

$ stardog data add --named-graph http://example.org/context myDb file.rdf

More examples are provided in the data add manual page.

Loading Compressed Data

Stardog supports loading data from compressed files directly: there’s no need to uncompress files before loading. Loading compressed data is the recommended way to load large input files. Stardog supports GZIP, BZIP2 and ZIP compressions natively.

GZIP and BZIP2

A file passed to db create will be treated as compressed if the file name ends with .gz or .bz2. The RDF format of the file is determined by the penultimate extension. For example, if a file named test.ttl.gz is used as input, Stardog will perform GZIP decompression during loading and parse the file with Turtle parser. All the formats supported by Stardog (RDF/XML, Turtle, Trig, etc.) can be used with compression.

ZIP

The ZIP support works differently since zipped files can contain many files. When an input file name ends with .zip, Stardog performs ZIP decompression and tries to load all the files inside the ZIP file. The RDF format of the files inside the zip is determined by their file names as usual. If there is an unrecognized file extension (e.g. .txt), then that file will be skipped.

Dropping a Database

The db drop command removes a database and all associated files and metadata. This means all files on disk related to the database will be deleted, so only use drop when you’re certain!

It takes as its only argument a valid database name. For example,

$ stardog-admin db drop myDb

Transactions

What follows is specific guidance about Stardog’s transactional semantics and guarantees. Generically speaking, Stardog supports ACID transactions. A good general purpose discussion of these issues in context of J2EE is this beginner’s guide.

Atomicity

Databases may guarantee atomicity – groups of database actions (i.e. mutations) are irreducible and indivisible: either all the changes happen or none of them happens. Stardog’s transacted writes are atomic. Stardog does not support nested transactions.

Consistency

Data stored should be valid according to the data model (in this case, RDF) and to the guarantees offered by the database, as well as to any application-specific integrity constraints that may exist. Stardog’s transactions are guaranteed not to violate integrity constraints during execution. A transaction that would leave a database in an inconsistent or invalid state is aborted.

See the Data Quality Constraints section for a more detailed consideration of Stardog’s integrity constraint mechanism.

Isolation

A Stardog connection will run in SNAPSHOTisolation level if it has not started an explicit transaction and will run in SNAPSHOT or SERIALIZABLE isolation level depending on the value of the database configuration option transaction.isolation. In any of these modes, uncommitted changes will only be visible to the connection that made the changes: no other connection can see those values before they are committed. Thus, “dirty reads” can never occur. Additionally, a transaction will only see changes which were committed before the transaction began, so there are no “non-repeatable reads”.

SNAPSHOT isolation does suffer from the write skew anomaly, which poses a problem when operating under external logical constraints. We illustrate this with the following example, where the database initially has two triples :a :val 100 and :b :val 50, and the application imposes the constraint that the total can never be less than 0.

Example of write-skew anomaly:

Time Connection 1 Connection 2 Connection 3
0 *BEGIN TX* *BEGIN TX*  
1 SELECT ?val {:a :val ?val} <= 100 SELECT ?val {:b :val ?val} <= 50  
2 INSERT {:a :val 0} INSERT {:b :val 0}  
3 *COMMIT* *COMMIT*  
4     *BEGIN TX*
5     SELECT ?val {?a :val ?val} <= 0
6     SELECT ?val {?b :val ?val} <= 0

At the end of this scenario, Connection 1 believes the state of the database to be :a :val 0 and :b :val 50, so the constraint is not violated. Similarly, Connection 2 believes the state of the database to be :a :val 100 and :b :val 0, which also does not violate the constraint. However, Connection 3 sees :a :val 0 and :b :val 0 which violates the logical constraint.

No locks are taken, or any conflict resolution performed, for concurrent transactions in SNAPSHOT isolation level. If there are conflicting changes, the transaction with the highest commit timestamp (functionally, the transaction which committed “last”) will be the result held in the database. This may yield unexpected results since every transaction reads from a snapshot that was created at the time its transaction started.

Consider the following query being executed by two concurrent threads in READ COMMITTED SNAPSHOT isolation level against a database having the triple :counter :val 1 initially:

INSERT { :counter :val ?newValue }
DELETE { :counter :val ?oldValue }
WHERE  { :counter :val ?oldValue
         BIND (?oldValue+1 AS ?newValue) }

Since each transaction will read the current value from its snapshot, it is possible that both transactions will read the value 1 and insert the value 2 even though we expect the final value to be 3.

Isolation level SERIALIZABLE can be used to avoid these situations. In SERIALIZABLE mode an exclusive lock needs to be acquired before a transaction begins. This ensures concurrent updates cannot interfere with each other, but as a result update throughput will decrease since only one transaction can run at a time.

Durability

By default Stardog’s transacted writes are durable and no other actions are required.


Chapter Contents