Link Search Menu Expand Document
Start for Free

Data Quality Constraints

This chapter discusses Stardog’s Integrity Constraint Validation (ICV) - a feature to enforce data integrity and help improve the knowledge graph’s correctness and consistency. This page provides an overview and shows you the basic usage of this feature. See the Chapter Contents to view what else is included in this chapter.

Page Contents
  1. Overview
  2. SHACL Constraints
  3. Adding Constraints
  4. Validating Constraints
    1. Validating Specific Graphs
    2. Validating Specific Nodes
    3. Validating Shapes from Specific Graphs
    4. Validating Specific Shapes
    5. Limiting Number of Violations
  5. ICV & Reasoning
  6. ICV Guard Mode
  7. SHACL Extensions in Stardog
    1. Query Dataset Specification in SPARQL constraints
  8. SHACL Support Limitations
  9. Chapter Contents

Overview

Stardog Integrity Constraint Validation (“ICV”) validates RDF data stored in a Stardog database according to constraints described by users and that make sense for their domain, application, and data. These constraints are written in SHACL (Shape Constraint Language). Using a high level language as a constraint language for RDF and Linked Data has several advantages:

  • Unifying the domain model with data quality rules
  • Aligning the domain model and data quality rules with the integration model and language (i.e., RDF)
  • Being able to query the domain model, data quality rules, integration model, mapping rules, etc. with SPARQL
  • Being able to use automated reasoning about all of these things to insure logical consistency, explain errors and problems, etc.

Typical ICV usage is to add constraints to a database that has the domain data and validate the database to see if there are any violations. It is also possible to enable the guard mode that will enforce the constraints at database modification time.

SHACL Constraints

SHACL and other data quality concepts are demonstrated in our Data Quality Training. You can find an illustrative example of SHACL constraints for the music tutorial dataset in the tutorials repo.

Adding Constraints

SHACL is expressed as RDF so SHACL constraints can be added to a Stardog database just as any other RDF data. Best practice is to store SHACL definitions in one or more named graphs to make the management easier.

$ stardog data add -g urn:example:constraints stardog-tutorial-music stardog-tutorials/shacl/music_shacl.ttl

By default, validation process will use any SHACL definition in any named graph but the database configuration option shacl.shape.graphs can be set to a list of named graphs to restrict which named graphs should be used to look up the constraints. The shape graphs can be specified at validation time as explined below.

When SHACL constraints are stored in a named graph, clearing the named graph will remove the constraints from the database:

$ stardog data remove -g urn:example:constraints stardog-tutorial-music

The SHACL constraints in the database can be queries with SPARQL as regular data. But the icv export command can also be used to show the list of SHACL constraints:

$ stardog icv export stardog-tutorial-music
ShaclConstraint{http://stardog.com/tutorial/SongShape}
ShaclConstraint{http://stardog.com/tutorial/AlbumShape}
ShaclConstraint{http://stardog.com/tutorial/ArtistShape}
ShaclConstraint{http://stardog.com/tutorial/BandShape}

If the -f/--format option is used, the contents of the constraints can be exported in any desired RDF format, e.g. the following command will export the constraints in pretty Turtle format:

$ stardog icv export -f pretty stardog-tutorial-music

Validating Constraints

Validation is the process of checking whether a database is valid with respect to the integrity constraints. The result is a validation report which is a collection of violations. If there are no violations then we say the database conforms to the constraints, or in other words it is valid. Each violation points to a node and a constraint along with other auxiliary information to explain what has been violated. The validation report can be retrieved with the following command:

$ stardog icv report testdb

For a valid database the report will look as follows:

@prefix sh: <http://www.w3.org/ns/shacl#> .

[
a sh:ValidationReport ;
sh:conforms true
] .

An example validation report showing some violations look as follows:

@prefix : <http://stardog.com/tutorial/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .


[
    a sh:ValidationReport ;
    sh:conforms false ;
    sh:result [
        a sh:ValidationResult ;
        sh:resultSeverity sh:Violation ;
        sh:sourceShape :SongLengthShape ;
        sh:sourceConstraintComponent sh:DatatypeConstraintComponent ;
        sh:focusNode :Love_Me_Do ;
        sh:resultPath :length ;
        sh:value 125.0 ;
        sh:resultMessage "Value must have datatype xsd:integer"
    ] , [
        a sh:ValidationResult ;
        sh:resultSeverity sh:Violation ;
        sh:sourceShape :AlbumDateShape ;
        sh:sourceConstraintComponent sh:MaxCountConstraintComponent ;
        sh:focusNode :Please_Please_Me ;
        sh:resultPath :date ;
        sh:resultMessage "There must be <= 1 values"
    ] , [
        a sh:ValidationResult ;
        sh:resultSeverity sh:Violation ;
        sh:sourceShape :AlbumTrackShape ;
        sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
        sh:focusNode :McCartney ;
        sh:resultPath :track ;
        sh:resultMessage "There must be >= 1 values"
    ]
] .

Validating Specific Graphs

By default, validation works over the whole database and validates the contents of every named graph. Validation can be performed over one or more specific named graphs:

$ stardog icv report -g urn:example:graph1 urn:example:graph2 -- testdb

Note, the -- is required to separates named graph arguments from the database name. The named graph IRIs can be specified as prefixed names using stored namespaces.

Validating Specific Nodes

The focus of validation can be made as fine-grained as one or more nodes in the graph:

$ stardog icv report --nodes urn:example:MyNode -- testdb

In this case, only the specified node(s) will be validated will be validated using the applicable constraints. Any other node or constraint will be ignored.

Validating Shapes from Specific Graphs

If the constraints are loaded into multiple named graphs, an optional argument to the validation command can be used to validate only the constraints from some of those graphs:

$ stardog icv report --shape-graphs urn:example:constraintGraph -- testdb

Validating Specific Shapes

The shapes to validate can be directly specified for the validation command too:

$ stardog icv report --shapes :SongShape :AlbumShape -- testdb

Limiting Number of Violations

The number of violations might be too high for some databases where it is desirable to limit the number of violations included in the validation report for performance or readability reasons. There are two different parameters that cen be used for this purpose. The first parameter limits the total number of violations returned in a report. By default, a limit of 100 is used for the CLI command but a different can be specified as follows:

$ stardog icv report --limit 1000 testdb

The --limit -1 can be used to return all the violations.

The second parameter is to limit the violations reported per shape. This is useful to get a quick summary of all the shapes for which there are violations. By default, 10 violations per shape are returned but the default limit of 100 is also applied. If --limit-per-shape is specified without --limit then the global limit is automatically set to -1. The following example will return at most one violation for each shape:

$ stardog icv report --limit-per-shape 1 testdb

ICV & Reasoning

An integrity constraint may be satisfied or violated in either of two ways: by an explicit statement in a Stardog database or by a statement that’s been validly inferred by Stardog. For this reason, the validation results will change if reasoning is enabled or disabled. By default, reasoning is disabled dfor validation but can be enabled just like any other CLI command using the -r, --reasoning option:

$ stardog icv report --reasoning testdb

ICV Guard Mode

Stardog will also apply constraints as part of its transactional cycle and fail transactions that violate constraints. We call this “guard mode”. It must be enabled explicitly in the database configuration options. Using the command line, these steps are as follows:

  1. Take the database offline.

     $ stardog-admin db offline myDb
    
  2. Enable ICV with the icv.enabled database configuration option.

     $ stardog-admin metadata set -o icv.enabled=true myDb
    
  3. Bring the database back online.

     $ stardog-admin db online myDb
    

Once guard mode is enabled, modifications of the database (via SPARQL Update or any other method), whether adds or deletes, that violate the integrity constraints will cause the transaction to fail.

SHACL Extensions in Stardog

This section discusses the SHACL Extensions in Stardog that is not covered by the SHACL standard.

Query Dataset Specification in SPARQL constraints

SPARQL Constraints in SHACL are supported by Stardog. However the only way to define the dataset for constraint queries is to put FROM or FROM NAMED statements directly in the query which is not always convenient. Consequently, to address the need, a non-standard SHACL property is introduced as an extension by Stardog as of version 7.9, namely tag:stardog:api:shacl:fromNamed.

Let’s exemplify the concept: Assume that we need to have data in both staging and production graphs (called :stagingGraph and :productionGraph, respectively). The graphs should not have any matching data about the same node. Thus, we’d like to validate :stagingGraph against :productionGraph within the constraint query, in such a way that only nodes in :stagingGraph are to be validated (i.e. are target nodes for the shape) while the constraint query is executed against :productionGraph:

# The shape using the SPARQL Constraint below
:DepartmentShape
  rdf:type sh:NodeShape ;
  sh:targetClass :Department ;
  sh:sparql :OneDirectorOnly-sparql
.

# The SPARQL Constraint with the custom extension
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix stardog: <http://api.stardog.com/> .
@prefix stardogSh: <tag:stardog:api:shacl:> .

:OneDirectorOnly-sparql
  rdf:type sh:SPARQLConstraint ;
  sh:message "a department can only have one director." ;
  stardogSh:fromNamed :productionGraph ;
  sh:select """
            prefix : <http://api.stardog.com/>
            SELECT ?this
            WHERE {
                # this is evaluated against the staging graph
                ?this :director ?director .
                GRAPH ?g {
                    # this is evaluated against the production graph
                    ?this :director ?anotherDirector .
                }
                FILTER ( ?director != ?anotherDirector )
             }
            """ ;
.

Imagine trying to accomplish the example task above of constraining data matches between two graphs with using an existing option of –named-graphs in ICV report: Such a constraint would be applicable to departments only in :stagingGraph, since ?this would be pre-bound to departments in :stagingGraph as it would be assigned to --named-graphs. However, the new stardogSh:fromNamed property now makes the constraint query (with pre-bound ?this) run against :productionGraph, and that query should not find a matching instance.

With the introduction of this extension, it is necessary to address the priority between given --named-graphs, the FROM part of the query, and the FROM NAMED part of the query, while constructing the query dataset for SPARQL constraint:

  • if stardogSh:fromNamed is provided, it takes precedence over both FROM NAMED in the query, and --named-graphs to define the named part of the query dataset.
  • --named-graphs takes precedence over FROM NAMED graphs in the query.
  • stardogSh:fromNamed has no effect on the default part of the query dataset. It’s defined by --named-graphs (if provided) or FROM in the query. If none are specified, it’s all local graphs (including the default graph).

SHACL Support Limitations

Stardog supports all the features in the core SHACL Language with the following exceptions:

  1. Stardog supports SPARQL-based constraints but does not support prebinding the $shapesGraph or $currentShape variables in SPARQL
  2. Stardog does not support property validators.
  3. Stardog does not support the Advanced Features or the JavaScript Extensions

Chapter Contents