Link Search Menu Expand Document
Start for Free

Data Quality Constraints

This chapter discusses Stardog’s Integrity Constraint Validation (ICV) - a feature to enforce data integrity and help improve the knowledge graph’s correctness and consistency. This page provides an overview and shows you the basic usage of this feature. See the Chapter Contents to view what else is included in this chapter.

Page Contents
  1. Overview
  2. CLI
  3. ICV & OWL 2 Reasoning
  4. ICV Guard Mode
  5. Explaining ICV Violations
  6. Security Note for ICV
  7. Repairing ICV Violations
  8. SHACL Constraints
  9. SHACL Support Limitations
  10. Constraints Formats
  11. Terminology
  12. Chapter Contents

Overview

Stardog Integrity Constraint Validation (“ICV”) validates RDF data stored in a Stardog database according to constraints described by users and that make sense for their domain, application, and data. These constraints may be written in SPARQL, OWL, or SWRL, and SHACL.

The use of high-level languages (OWL 2, SWRL, and SPARQL) to validate RDF data using closed world semantics is one of Stardog’s unique capabilities. Using high level languages like OWL, SWRL, and SPARQL as schema or constraint languages for RDF and Linked Data has several advantages:

  • Unifying the domain model with data quality rules
  • Aligning the domain model and data quality rules with the integration model and language (i.e., RDF)
  • Being able to query the domain model, data quality rules, integration model, mapping rules, etc. with SPARQL
  • Being able to use automated reasoning about all of these things to insure logical consistency, explain errors and problems, etc.

See the extended ICV tutorial in Github and our blog post, Data Quality with ICV, for more details about using ICV. SHACL is demonstrated in our Data Validation and SHACL webinar.

CLI

The CLI icv commands can be used to add, delete, or drop all constraints from an existing database. It may also be used to validate an existing database with constraints that are passed into the icv command; that is, using different constraints than the ones already associated with the database.

For a full description of ICV CLI usage, execute stardog help icv and stardog-admin help icv from the command line. Alternatively see the icv command group pages in the Stardog and Stardog Admin CLI Reference manuals for a full description of all icv commands.

To add constraints to a database:

$ stardog-admin icv add myDb constraints.rdf

To drop all constraints from a database:

$ stardog-admin icv drop myDb

To remove one or more specific constraints from a database:

$ stardog-admin icv remove myDb constraints.rdf

To convert new or existing constraints into SPARQL queries for export:

$ stardog icv convert myDb constraints.rdf

To explain a constraint violation:

$ stardog icv explain --contexts http://example.org/context1 http://example.org/context2 -- myDb

To export constraints:

$ stardog icv export myDb constraints.rdf

To validate a database (or some named graphs) with respect to constraints:

$ stardog icv validate --contexts http://example.org/context1 http://example.org/context2 -- myDb

ICV & OWL 2 Reasoning

An integrity constraint may be satisfied or violated in either of two ways: by an explicit statement in a Stardog database or by a statement that’s been validly inferred by Stardog.

When ICs are being validated the user needs to specify if reasoning will be used or not. So ICV is performed with three inputs:

  1. a Stardog database,
  2. a set of constraints, and
  3. a reasoning flag (which may be, of course, set to false for no reasoning).

This is the case because domain modelers, ontology developers, or integrity constraint authors must consider the interactions between explicit and inferred statements and how these are accounted for in integrity constraints.

ICV Guard Mode

Stardog will also apply constraints as part of its transactional cycle and fail transactions that violate constraints. We call this “guard mode”. It must be enabled explicitly in the database configuration options. Using the command line, these steps are as follows:

  1. Take the database offline.

     $ stardog-admin db offline myDb
    
  2. Enable ICV with the icv.enabled database configuration option.

     $ stardog-admin metadata set -o icv.enabled=true myDb
    
  3. Bring the database back online.

     $ stardog-admin db online myDb
    

Once guard mode is enabled, modifications of the database (via SPARQL Update or any other method), whether adds or deletes, that violate the integrity constraints will cause the transaction to fail.

We show how to enable guard mode at database creation time programatically in Java later in this section.

Explaining ICV Violations

ICV violations can be explained using Stardog’s Proof Trees. The following command will explain the IC violations for constraints stored in the database:

$ stardog icv explain --reasoning myDB

The command is flexible to change the number of violations displayed, and to explain violations for external constraints by passing the file with constraints as an additional argument:

$ stardog icv explain --reasoning --limit 2 myDB constraints.ttl

Security Note for ICV

There is a security implication in this design that may not be obvious. Changing the reasoning type associated with a database and integrity constraint validation may have serious security implications with respect to a Stardog database and, thus, may only be performed by a user role with sufficient privileges for that action.

Repairing ICV Violations

Stardog has support for automatic repair of some kinds of integrity violation. This can be accomplished programmatically via the API, as well as via CLI using the icv fix subcommand.

Repair plans are emitted as a sequence of SPARQL Update queries, which means they can be applied to any system that understands SPARQL Update. If you pass --execute the repair plan will be applied immediately.

The icv fix command will repair violations of all constraints in the database; if you’d prefer to fix the violations for only some constraints, you can pass those constraints as an additional argument. Although a possible (but trivial) fix for any violation is to remove one or more constraints, icv fix does not suggest that kind of repair, even though it may be appropriate in some cases.

SHACL Constraints

As of version 6.1, Stardog supports validation of SHACL constraints. SHACL constraints can be managed like any other constraint Stardog supports and all the existing validation commands work with SHACL constraints.

Normally constraints are stored in the system database and managed with special commands icv add and icv remove. This is still possible with SHACL constraints but if desired SHACL constraints can be loaded into the database along with regular data using data add. Validation results will be the same in both cases.

SHACL support comes with a new validation command that outputs the SHACL validation report:

$ stardog icv report myDb

SHACL Support Limitations

Stardog supports all the features in the core SHACL Language with the following exceptions:

  1. Stardog supports SPARQL-based constraints but does not support prebinding the $shapesGraph or $currentShape variables in SPARQL
  2. Stardog does not support property validators.
  3. Stardog does not support the Advanced Features or the JavaScript Extensions

Constraints Formats

In addition to OWL, ICV constraints can be expressed in SPARQL and Stardog Rules. In both cases, the constraints define queries and rules to find violations. These constraints can be added individually, or defined together in a file as shown below:

@prefix rule: <tag:stardog:api:rule:> .
@prefix icv: <tag:stardog:api:icv:> .

# Rule Constraint
[] a rule:SPARQLRule ; 

   rule:content """

    prefix : <http://example.org/>

    IF { 
        ?x a :Employee .  
    } 
    THEN { 
        ?x :employeeNum ?number . 
    }  

   """ .


# SPARQL Constraint
[] a icv:Constraint ;

   icv:query """
    prefix : <http://example.org/>

       select * { 

        ?x a :Employee . 

        FILTER NOT EXISTS {
            ?x :employeeNum ?number .
        } 
    } 

   """ .

Terminology

We define some common terminology used in this chapter.

ICV, Integrity Constraint Validation

The process of checking whether some Stardog database is valid with respect to some integrity constraints. The result of ICV is a boolean value (true if valid, false if invalid) and, optionally, an explanation of constraint violations.

Schema, TBox

A schema (or “terminology box” a.k.a., TBox) is a set of statements that define the relationships between data elements, including property and class names, their relationships, etc. In practical terms, schema statements for a Stardog database are RDF Schema and OWL 2 terms, axioms, and definitions.

Data, ABox

All of the triples in a Stardog database that aren’t part of the schema are part of the data (or “assertional box” a.k.a. ABox).

Integrity Constraint

A declarative expression of some rule or constraint which data must conform to in order to be valid. Integrity Constraints are typically domain and application specific. They can be expressed in OWL 2 (any legal syntax), SWRL rules, or (a restricted form of) SPARQL queries.

Constraints

Constraints that have been associated with a Stardog database and which are used to validate the data it contains. Each Stardog may optionally have one and only one set of constraints associated with it.

Closed World Assumption, Closed World Reasoning

Stardog ICV assumes a closed world with respect to data and constraints: that is, it assumes that all relevant data is known to it and included in a database to be validated. It interprets the meaning of Integrity Constraints in light of this assumption; if a constraint says a value must be present, the absence of that value is interpreted as a constraint violation and, hence, as invalid data.

Open World Assumption, Open World Reasoning

A legal OWL 2 inference may violate or satisfy an Integrity Constraint in Stardog. In other words, you get to have your cake (OWL as a constraint language) and eat it, too (OWL as modeling or inference language). This means that constraints are applied to a Stardog database with respect to an OWL 2 profile.

Monotonicity

OWL is a monotonic language: that means you can never add anything to a Stardog database that causes there to be fewer legal inferences. Or, put another way, the only way to decrease the number of legal inferences is to delete something.

Monotonicity interacts with ICV in the following ways:

  1. Adding data to or removing it from a Stardog database may make it invalid.
  2. Adding schema statements to or removing them from a Stardog database may make it invalid.
  3. Adding new constraints to a Stardog database may make it invalid.
  4. Deleting constraints from a Stardog database cannot make it invalid.

Chapter Contents