Data Quality Constraints
This chapter discusses Stardog’s Integrity Constraint Validation (ICV) - a feature to enforce data integrity and help improve the knowledge graph’s correctness and consistency. This page provides an overview and shows you the basic usage of this feature. See the Chapter Contents to view what else is included in this chapter.
Page Contents
Overview
Stardog Integrity Constraint Validation (“ICV”) validates RDF data stored in a Stardog database according to constraints described by users and that make sense for their domain, application, and data. These constraints may be written in SPARQL, OWL, or SWRL, and SHACL.
The use of high-level languages (OWL 2, SWRL, and SPARQL) to validate RDF data using closed world semantics is one of Stardog’s unique capabilities. Using high level languages like OWL, SWRL, and SPARQL as schema or constraint languages for RDF and Linked Data has several advantages:
- Unifying the domain model with data quality rules
- Aligning the domain model and data quality rules with the integration model and language (i.e., RDF)
- Being able to query the domain model, data quality rules, integration model, mapping rules, etc. with SPARQL
- Being able to use automated reasoning about all of these things to insure logical consistency, explain errors and problems, etc.
See the extended ICV tutorial in Github and our blog post, Data Quality with ICV, for more details about using ICV. SHACL is demonstrated in our Data Validation and SHACL webinar.
CLI
The CLI icv
commands can be used to add, delete, or drop all constraints from an existing database. It may also be used to validate an existing database with constraints that are passed into the icv
command; that is, using different constraints than the ones already associated with the database.
For a full description of ICV CLI usage, execute stardog help icv
and stardog-admin help icv
from the command line. Alternatively see the icv
command group pages in the Stardog and Stardog Admin CLI Reference manuals for a full description of all icv
commands.
To add constraints to a database:
$ stardog-admin icv add myDb constraints.rdf
To drop all constraints from a database:
$ stardog-admin icv drop myDb
To remove one or more specific constraints from a database:
$ stardog-admin icv remove myDb constraints.rdf
To convert new or existing constraints into SPARQL queries for export:
$ stardog icv convert myDb constraints.rdf
To explain a constraint violation:
$ stardog icv explain --contexts http://example.org/context1 http://example.org/context2 -- myDb
To export constraints:
$ stardog icv export myDb constraints.rdf
To validate a database (or some named graphs) with respect to constraints:
$ stardog icv validate --contexts http://example.org/context1 http://example.org/context2 -- myDb
ICV & OWL 2 Reasoning
An integrity constraint may be satisfied or violated in either of two ways: by an explicit statement in a Stardog database or by a statement that’s been validly inferred by Stardog.
When ICs are being validated the user needs to specify if reasoning will be used or not. So ICV is performed with three inputs:
- a Stardog database,
- a set of constraints, and
- a reasoning flag (which may be, of course, set to false for no reasoning).
This is the case because domain modelers, ontology developers, or integrity constraint authors must consider the interactions between explicit and inferred statements and how these are accounted for in integrity constraints.
ICV Guard Mode
Stardog will also apply constraints as part of its transactional cycle and fail transactions that violate constraints. We call this “guard mode”. It must be enabled explicitly in the database configuration options. Using the command line, these steps are as follows:
-
Take the database offline.
$ stardog-admin db offline myDb
-
Enable ICV with the
icv.enabled
database configuration option.$ stardog-admin metadata set -o icv.enabled=true myDb
-
Bring the database back online.
$ stardog-admin db online myDb
Once guard mode is enabled, modifications of the database (via SPARQL Update or any other method), whether adds or deletes, that violate the integrity constraints will cause the transaction to fail.
We show how to enable guard mode at database creation time programatically in Java later in this section.
Explaining ICV Violations
ICV violations can be explained using Stardog’s Proof Trees. The following command will explain the IC violations for constraints stored in the database:
$ stardog icv explain --reasoning myDB
The command is flexible to change the number of violations displayed, and to explain violations for external constraints by passing the file with constraints as an additional argument:
$ stardog icv explain --reasoning --limit 2 myDB constraints.ttl
Security Note for ICV
There is a security implication in this design that may not be obvious. Changing the reasoning type associated with a database and integrity constraint validation may have serious security implications with respect to a Stardog database and, thus, may only be performed by a user role with sufficient privileges for that action.
Repairing ICV Violations
Stardog has support for automatic repair of some kinds of integrity violation. This can be accomplished programmatically via the API, as well as via CLI using the icv fix
subcommand.
Repair plans are emitted as a sequence of SPARQL Update queries, which means they can be applied to any system that understands SPARQL Update. If you pass --execute
the repair plan will be applied immediately.
The icv fix
command will repair violations of all constraints in the database; if you’d prefer to fix the violations for only some constraints, you can pass those constraints as an additional argument. Although a possible (but trivial) fix for any violation is to remove one or more constraints, icv fix
does not suggest that kind of repair, even though it may be appropriate in some cases.
SHACL Constraints
As of version 6.1, Stardog supports validation of SHACL constraints. SHACL constraints can be managed like any other constraint Stardog supports and all the existing validation commands work with SHACL constraints.
Normally constraints are stored in the system database and managed with special commands icv add
and icv remove
. This is still possible with SHACL constraints but if desired SHACL constraints can be loaded into the database along with regular data using data add
. Validation results will be the same in both cases.
SHACL support comes with a new validation command that outputs the SHACL validation report:
$ stardog icv report myDb
SHACL Support Limitations
Stardog supports all the features in the core SHACL Language with the following exceptions:
- Stardog supports SPARQL-based constraints but does not support prebinding the
$shapesGraph
or$currentShape
variables in SPARQL - Stardog does not support property validators.
- Stardog does not support the Advanced Features or the JavaScript Extensions
Constraints Formats
In addition to OWL, ICV constraints can be expressed in SPARQL and Stardog Rules. In both cases, the constraints define queries and rules to find violations. These constraints can be added individually, or defined together in a file as shown below:
@prefix rule: <tag:stardog:api:rule:> .
@prefix icv: <tag:stardog:api:icv:> .
# Rule Constraint
[] a rule:SPARQLRule ;
rule:content """
prefix : <http://example.org/>
IF {
?x a :Employee .
}
THEN {
?x :employeeNum ?number .
}
""" .
# SPARQL Constraint
[] a icv:Constraint ;
icv:query """
prefix : <http://example.org/>
select * {
?x a :Employee .
FILTER NOT EXISTS {
?x :employeeNum ?number .
}
}
""" .
Terminology
We define some common terminology used in this chapter.
ICV, Integrity Constraint Validation
The process of checking whether some Stardog database is valid with respect to some integrity constraints. The result of ICV is a boolean value (true if valid, false if invalid) and, optionally, an explanation of constraint violations.
Schema, TBox
A schema (or “terminology box” a.k.a., TBox) is a set of statements that define the relationships between data elements, including property and class names, their relationships, etc. In practical terms, schema statements for a Stardog database are RDF Schema and OWL 2 terms, axioms, and definitions.
Data, ABox
All of the triples in a Stardog database that aren’t part of the schema are part of the data (or “assertional box” a.k.a. ABox).
Integrity Constraint
A declarative expression of some rule or constraint which data must conform to in order to be valid. Integrity Constraints are typically domain and application specific. They can be expressed in OWL 2 (any legal syntax), SWRL rules, or (a restricted form of) SPARQL queries.
Constraints
Constraints that have been associated with a Stardog database and which are used to validate the data it contains. Each Stardog may optionally have one and only one set of constraints associated with it.
Closed World Assumption, Closed World Reasoning
Stardog ICV assumes a closed world with respect to data and constraints: that is, it assumes that all relevant data is known to it and included in a database to be validated. It interprets the meaning of Integrity Constraints in light of this assumption; if a constraint says a value must
be present, the absence of that value is interpreted as a constraint violation and, hence, as invalid data.
Open World Assumption, Open World Reasoning
A legal OWL 2 inference may violate or satisfy an Integrity Constraint in Stardog. In other words, you get to have your cake (OWL as a constraint language) and eat it, too (OWL as modeling or inference language). This means that constraints are applied to a Stardog database with respect to an OWL 2 profile.
Monotonicity
OWL is a monotonic language: that means you can never add
anything to a Stardog database that causes there to be fewer
legal inferences. Or, put another way, the only way to decrease the number of legal inferences is to delete
something.
Monotonicity interacts with ICV in the following ways:
- Adding data to or removing it from a Stardog database may make it invalid.
- Adding schema statements to or removing them from a Stardog database may make it invalid.
- Adding new constraints to a Stardog database may make it invalid.
- Deleting constraints from a Stardog database cannot make it invalid.