Data Quality Constraints
This chapter discusses Stardog’s Integrity Constraint Validation (ICV) - a feature to enforce data integrity and help improve the knowledge graph’s correctness and consistency. This page provides an overview and shows you the basic usage of this feature. See the Chapter Contents to view what else is included in this chapter.
Page Contents
Overview
Stardog Integrity Constraint Validation (“ICV”) validates RDF data stored in a Stardog database according to constraints described by users and that make sense for their domain, application, and data. These constraints are written in SHACL (Shape Constraint Language). Using a high-level language as a constraint language for RDF and Linked Data has several advantages:
- Unifying the domain model with data quality rules
- Aligning the domain model and data quality rules with the integration model and language (i.e., RDF)
- Being able to query the domain model, data quality rules, integration model, mapping rules, etc. with SPARQL
- Being able to use automated reasoning about all of these things to insure logical consistency, explain errors and problems, etc.
Typical ICV usage is to add constraints to a database that has the domain data and validate the database to see if there are any violations. It is also possible to enable the guard mode that will enforce the constraints at database modification time.
SHACL Constraints
SHACL and other data quality concepts are demonstrated in our Data Quality Training. You can find an illustrative example of SHACL constraints for the music tutorial dataset in the tutorials repo.
Adding Constraints
SHACL is expressed as RDF so SHACL constraints can be added to a Stardog database just as any other RDF data. Best practice is to store SHACL definitions in one or more named graphs to make the management easier.
$ stardog data add -g urn:example:constraints stardog-tutorial-music stardog-tutorials/shacl/music_shacl.ttl
By default, validation process will use any SHACL definition in any named graph but the database configuration option shacl.shape.graphs
can be set to a list of named graphs to restrict which named graphs should be used to look up the constraints. The shape graphs can be specified at validation time as explained below.
When SHACL constraints are stored in a named graph, clearing the named graph will remove the constraints from the database:
$ stardog data remove -g urn:example:constraints stardog-tutorial-music
The SHACL constraints in the database can be queried with SPARQL as regular data. But the icv export
command can also be used to show the list of SHACL constraints:
$ stardog icv export stardog-tutorial-music
ShaclConstraint{http://stardog.com/tutorial/SongShape}
ShaclConstraint{http://stardog.com/tutorial/AlbumShape}
ShaclConstraint{http://stardog.com/tutorial/ArtistShape}
ShaclConstraint{http://stardog.com/tutorial/BandShape}
If the -f/--format
option is used, the contents of the constraints can be exported in any desired RDF format, e.g. the following command will export the constraints in pretty Turtle format:
$ stardog icv export -f pretty stardog-tutorial-music
Validate SPARQL query
Validation is the process of checking whether a database is valid with respect to the integrity constraints. The result is a validation report which is a collection of violations. If there are no violations then we say the database conforms to the constraints, or in other words it is valid. Each violation points to a node and a constraint along with other auxiliary information to explain what has been violated. The validation report can be retrieved by executing a VALIDATE
query. VALIDATE
query is a new top-level query form, i.e. separate from SELECT
, CONSTRUCT
or other query types introduced by Stardog. The VALIDATE
query return an RDF graph as its result similar to CONSTRUCT
and DESCRIBE
queries. The result is a SHACL validation report as defined in the SHACL specification.
The syntax of VALIDATE
queries is as follows:
VALIDATE (ALL | [<IRI>+] [GRAPH <IRI>+])
[USING SHAPES (<IRI>+ | GRAPH <IRI>+ | <QuadData>) ]
[LIMIT <int>]
[LIMIT PER SHAPE <int>]
where IRI
and QuadData
are defined in the SPARQL grammar.
The details of VALIDATE
queries are explained in the following sections.
Validating Constraints
VALIDATE
query in its simplest form looks as follows:
VALIDATE ALL
This query validates all the named graphs in the database using all the constraints stored within the database. More complex forms of the query can validate different subsets of te data using a subset of the constraints as explained below.
For a valid database the result of this query is a report that looks as follows:
@prefix sh: <http://www.w3.org/ns/shacl#> .
[
a sh:ValidationReport ;
sh:conforms true
] .
An example validation report showing some violations looks as follows:
@prefix : <http://stardog.com/tutorial/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
[
a sh:ValidationReport ;
sh:conforms false ;
sh:result [
a sh:ValidationResult ;
sh:resultSeverity sh:Violation ;
sh:sourceShape :SongLengthShape ;
sh:sourceConstraintComponent sh:DatatypeConstraintComponent ;
sh:focusNode :Love_Me_Do ;
sh:resultPath :length ;
sh:value 125.0 ;
sh:resultMessage "Value must have datatype xsd:integer"
] , [
a sh:ValidationResult ;
sh:resultSeverity sh:Violation ;
sh:sourceShape :AlbumDateShape ;
sh:sourceConstraintComponent sh:MaxCountConstraintComponent ;
sh:focusNode :Please_Please_Me ;
sh:resultPath :date ;
sh:resultMessage "There must be <= 1 values"
] , [
a sh:ValidationResult ;
sh:resultSeverity sh:Violation ;
sh:sourceShape :AlbumTrackShape ;
sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
sh:focusNode :McCartney ;
sh:resultPath :track ;
sh:resultMessage "There must be >= 1 values"
]
] .
Validating Specific Graphs
By default, validation works over the whole database and validates the contents of every named graph. Validation can be performed over one or more specific named graphs:
VALIDATE GRAPH ex:graph1 ex:graph2
Validating specific graphs mean any node outside these graphs will not be considered as targets of SHACL constraints being used for validation.
If a node violating a shape is stored in multiple named graphs then validation results for that node will be duplicated for each occurrence of the node.
Validating Specific Nodes
The focus of validation can be made as fine-grained as one or more specific nodes in the graph:
VALIDATE ex:MyNode
In this case, only the specified node(s) will be validated using the applicable constraints. Any other node will not be considered as a target. And any constraint for which the specified node(s) are not targets will be ignored. If the specified node IRIs are not found in the database then a violation will be returned for each such undefined node.
Validating Shapes from Specific Graphs
By default, VALIDATE
query will use any SHACL constraint in any named graph but the database configuration option shacl.shape.graphs
can be set to a list of named graphs to restrict which named graphs should be used to look up the constraints. It is also possible to define the shapes graph within thr VALIDATE
query:
VALIDATE ALL USING SHAPES GRAPH ex:constraintGraph
This query will use any shape defined within the specified graph(s).
If the same shape is stored in multiple named graphs then validation using those shape graphs will result in validation results for that shape to be duplicated for each occurrence of the shape.
Validating Specific Shapes
The shapes to validate can be directly specified for the validation command too:
VALIDATE ALL USING SHAPES :SongShape :AlbumShape
In this case no other shape wil lbe validated. If the specified shape IRIs are not found in the database then a violation will be returned for each such undefined shape.
Validating External Shapes
It is also possible to perform the validation using shapes that are not stored within the database. In this case, the shapes can be specified inline with the VALIDATE
query:
VALIDATE ALL USING SHAPES {
:NameShape a sh:NodeShape ;
sh:property [
sh:path :name ;
sh:minCount 1 ;
sh:datatype xsd:string
] .
}
Limiting Number of Violations
The number of violations might be too high for some databases where it is desirable to limit the number of violations included in the validation report for performance or readability reasons. A limit can be defined to stop validation when a certain number of violations have been found:
VALIDATE ALL LIMIT 10
In some cases, it is desirable to limit the violations reported per shape. This is useful to get a quick summary of all the shapes for which there are violations. A query hint can be used to limit the violations reported per shape. The following query will return at most 10 violations per shape:
VALIDATE ALL LIMIT PER SHAPE 1
For example, if there are three shapes for which there are violations then this query will return 3 results. The LIMIT PER SHAPE
can be used in conjunction with the query LIMIT
as well:
VALIDATE ALL LIMIT 100 LIMIT PER SHAPE 1
This query would return at most ona validation limit per shape but if there are more than 100 shapes with validation results the validation would step after the first 100 results have been returned.
Validate SPARQL service
In addition to the VALIDATE
query explained above Stardog supports a way to perform validation using the SERVICE keyword. This allows validation to be done as part of a SELECT query. Each solution returned by the validate service corresponds to a distinct violation result. If there are no violations then the service will return no solutions.
The following example shows how the service can be invoked:
PREFIX icv: <tag:stardog:api:icv:>
SELECT * {
SERVICE icv:validate {
# service input parameters
[] icv:dataGraph :myDataGraph;
icv:shapesGraph :myShapesGraph;
# service output parameters
icv:resultSeverity ?severity ;
icv:resultMessage ?message ;
icv:sourceShape ?shape ;
icv:sourceConstraint ?constraint ;
icv:sourceConstraintComponent ?component ;
icv:focusNode ?focusNode ;
icv:resultPath ?path ;
icv:value ?valueNode ;
}
}
Similar to VALIDATE
queries, data graphs and shapes graph can be specified for the validation process. If no input parameters are given then the validation will be over the whole database using all the constraints. The service supports for only constant input parameters; that is, input parameters cannot be specified as variables that will be bound by other parts of the query.
An example result of the SERVICE
looks as follows:
+--------------------------------------+------------------------------------------------------------------------+----------------------------------------------+------------+--------------------------------------------------------+----------------------------------------------+------------------------------------+--------------------------------------------+
| severity | message | shape | constraint | component | focusNode | path | valueNode |
+--------------------------------------+------------------------------------------------------------------------+----------------------------------------------+------------+--------------------------------------------------------+----------------------------------------------+------------------------------------+--------------------------------------------+
| http://www.w3.org/ns/shacl#Violation | "Value must have datatype xsd:integer" | http://stardog.com/tutorial/SongLengthShape | | http://www.w3.org/ns/shacl#DatatypeConstraintComponent | http://stardog.com/tutorial/Love_Me_Do | http://stardog.com/tutorial/length | 1.2E3 |
| http://www.w3.org/ns/shacl#Violation | "There must be >= 1 values" | http://stardog.com/tutorial/AlbumTrackShape | | http://www.w3.org/ns/shacl#MinCountConstraintComponent | http://stardog.com/tutorial/McCartney | http://stardog.com/tutorial/track | |
| http://www.w3.org/ns/shacl#Violation | "There must be <= 1 values" | http://stardog.com/tutorial/AlbumDateShape | | http://www.w3.org/ns/shacl#MaxCountConstraintComponent | http://stardog.com/tutorial/Imagine | http://stardog.com/tutorial/date | |
+--------------------------------------+------------------------------------------------------------------------+----------------------------------------------+------------+--------------------------------------------------------+----------------------------------------------+------------------------------------+--------------------------------------------+
Complex SHACL property paths are serialized as multiple triples in RDF. However, since the resultPath
in the SPARQL service is bound to a single RDF value this complexity cannot be expressed. If the result path in a violation is a predicate path then the resulting variable will be bound to the corresponding IRI. If the result path is a complex property path then the variable will be bound to the string representation of the path; e.g. ex:firstProperty/ex:secondProperty*
.
The validate service can be used to validate external constraints by providing the shapes inline within the SERVICE
block.
Relationship between the VALIDATE query and VALIDATE service
The VALIDATE
query form and the validate SPARQL service provide two different ways to retrieve the validation results. The VALIDATE
query can be thought as syntactic sugar for a CONSTRUCT
query using the SPARQL service:
PREFIX icv: <tag:stardog:api:icv:>
PREFIX sh: <http://www.w3.org/ns/shacl#>
CONSTRUCT {
?report a sh:ValidationReport ;
sh:conforms ?conforms ;
sh:result ?result .
?result
a sh:ValidationResult ;
sh:resultSeverity ?severity ;
sh:resultMessage ?message ;
sh:sourceShape ?shape ;
sh:sourceConstraint ?constraint ;
sh:sourceConstraintComponent ?component ;
sh:focusNode ?focusNode ;
sh:resultPath ?path ;
sh:value ?valueNode ;
}
WHERE {
BIND(bnode("_:ValidationReport") as ?report)
OPTIONAL {
BIND(bnode() as ?result)
SERVICE icv:validate {
_:serviceParams icv:dataGraph :staging;
icv:resultSeverity ?severity ;
icv:resultMessage ?message ;
icv:sourceShape ?shape ;
icv:sourceConstraint ?constraint ;
icv:sourceConstraintComponent ?component ;
icv:focusNode ?focusNode ;
icv:resultPath ?path ;
icv:value ?valueNode ;
}
}
BIND(!bound(?result) as ?conforms)
}
This is not exactly true due to complex property paths as explained in the warning above.
ICV & Reasoning
An integrity constraint may be satisfied or violated in either of two ways: by an explicit statement in a Stardog database or by a statement that’s been validly inferred by Stardog. For this reason, the validation results will change if reasoning is enabled or disabled. By default, reasoning is disabled for validation but can be enabled just like any other SPARQL query. For example, in the CLI, you can use the -r, --reasoning
option:
$ stardog query --reasoning testdb "VALIDATE ALL"
ICV Guard Mode
Stardog will also apply constraints as part of its transactional cycle and fail transactions that violate constraints. We call this “guard mode”. It must be enabled explicitly in the database configuration options. Using the command line, these steps are as follows:
-
Take the database offline.
$ stardog-admin db offline myDb
-
Enable ICV with the
icv.enabled
database configuration option.$ stardog-admin metadata set -o icv.enabled=true myDb
-
Bring the database back online.
$ stardog-admin db online myDb
Once guard mode is enabled, modifications of the database (via SPARQL Update or any other method), whether adds or deletes, that violate the integrity constraints will cause the transaction to fail.
SHACL Extensions in Stardog
This section discusses the SHACL Extensions in Stardog that is not covered by the SHACL standard.
Query Dataset Specification in SPARQL constraints
SPARQL Constraints in SHACL are supported by Stardog. However the only way to define the dataset for constraint queries is to put FROM
or FROM NAMED
statements directly in the query which is not always convenient. Consequently, to address the need, a non-standard SHACL property is introduced as an extension by Stardog as of version 7.9, namely tag:stardog:api:shacl:fromNamed
.
Let’s exemplify the concept: Assume that we need to have data in both staging and production graphs (called :stagingGraph
and :productionGraph
, respectively). The graphs should not have any matching data about the same node. Thus, we’d like to validate :stagingGraph
against :productionGraph
within the constraint query, in such a way that only nodes in :stagingGraph
are to be validated (i.e. are target nodes for the shape) while the constraint query is executed against :productionGraph
:
# The shape using the SPARQL Constraint below
:DepartmentShape
rdf:type sh:NodeShape ;
sh:targetClass :Department ;
sh:sparql :OneDirectorOnly-sparql
.
# The SPARQL Constraint with the custom extension
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix stardog: <http://api.stardog.com/> .
@prefix stardogSh: <tag:stardog:api:shacl:> .
:OneDirectorOnly-sparql
rdf:type sh:SPARQLConstraint ;
sh:message "a department can only have one director." ;
stardogSh:fromNamed :productionGraph ;
sh:select """
prefix : <http://api.stardog.com/>
SELECT ?this
WHERE {
# this is evaluated against the staging graph
?this :director ?director .
GRAPH ?g {
# this is evaluated against the production graph
?this :director ?anotherDirector .
}
FILTER ( ?director != ?anotherDirector )
}
""" ;
.
Imagine trying to accomplish the example task above of constraining data matches between two graphs with using an existing option of –named-graphs in ICV report: Such a constraint would be applicable to departments only in :stagingGraph
, since ?this
would be pre-bound to departments in :stagingGraph
as it would be assigned to --named-graphs
. However, the new stardogSh:fromNamed
property now makes the constraint query (with pre-bound ?this
) run against :productionGraph
, and that query should not find a matching instance.
With the introduction of this extension, it is necessary to address the priority between given --named-graphs
, the FROM
part of the query, and the FROM NAMED
part of the query, while constructing the query dataset for SPARQL constraint:
- if
stardogSh:fromNamed
is provided, it takes precedence over bothFROM NAMED
in the query, and--named-graphs
to define the named part of the query dataset. --named-graphs
takes precedence overFROM NAMED
graphs in the query.stardogSh:fromNamed
has no effect on the default part of the query dataset. It’s defined by--named-graphs
(if provided) orFROM
in the query. If none are specified, it’s all local graphs (including the default graph).
SHACL Support Limitations
Stardog supports all the features in the core SHACL Language with the following exceptions:
- Stardog supports SPARQL-based constraints but does not support prebinding the
$shapesGraph
or$currentShape
variables in SPARQL - Stardog does not support property validators.
- Stardog does not support the Advanced Features or the JavaScript Extensions