Link Search Menu Expand Document
Start for Free

Obfuscating Data

This page discusses obfuscating datasets and queries in Stardog to prevent sharing sensitive data.

Page Contents
  1. What is Data Obfuscation?
  2. Why use Data Obfuscation?
  3. How to Obfuscate Data and Queries
  4. Example
    1. 1. Database Setup
    2. 2. Run the Query Against Sample Database
    3. 3. Obfuscate the Data
    4. 4. Obfuscate the Query
    5. 5. Executing the Obfuscated Query
  5. Additional Configuration
    1. Message Digest Algorithm
    2. What to Include/Exclude in Your Obfuscated Dataset

What is Data Obfuscation?

  • Data obfuscation is a mechanism that exports data in an obscure, unclear and unintelligible dataset while keeping the overall structure of the data intact.
  • Data obfuscation is very secure. Once the data is obfuscated, it is impossible to return to the original data.

Why use Data Obfuscation?

Data obfuscation can help by easily producing a dataset and query to:

  • submit bug reports using sensitive data
  • share sensitive RDF data with others

How to Obfuscate Data and Queries

The data obfuscate command is used obfuscate data. It is very similar to the data export command:

stardog data obfuscate -f TRIG -g ALL myDatabase /tmp/obfuscated.trig
  • In the above example, all named graphs are exported in the TriG format from the database (myDatabase) to a file (/tmp/obfuscated.trig).
  • The exported data can be loaded into any Stardog database.

Once the data is obfuscated, queries written against the original data will no longer work against the obfuscated data. The query obfuscate command is used to obfuscate queries.

# redirects obfuscated query to obfuscated.sparql
stardog query obfuscate myDatabase query.sparql > obfuscated.sparql
  • The obfuscated query (obfuscated.sparql) can then be executed against the database with the obfuscated data loaded into it.

Example

The following example shows how to obfuscate a small dataset and query.

1. Database Setup

  1. Create a file sample.trig:

     <http://example.com/graph> {
         <http://example.com/test/1> <http://example.com/attribute/name> "t1" .
         <http://example.com/test/1> <http://example.com/attribute/id> "0001" .
         <http://example.com/test/2> <http://example.com/attribute/name> "t1" .
         <http://example.com/test/2> <http://example.com/attribute/id> "0002" .
     }
    
  2. Create a database (myDatabase) and load sample.trig to it:

     stardog-admin db create -n myDatabase sample.trig
    

2. Run the Query Against Sample Database

  1. Create a file (query.sparql) that contains the following SPARQL query:

     SELECT ?name
     FROM <http://example.com/graph>
     WHERE {
         <http://example.com/test/1>  <http://example.com/attribute/name> ?name
     }
    
  2. Execute the query:

     stardog query execute myDatabase query.sparql
     +-------+
     | name  |
     +-------+
     | "t1"  |
     +-------+
    

3. Obfuscate the Data

stardog data obfuscate -f TRIG -g ALL myDatabase obfuscated.trig

obfuscated.trig

@prefix obf: <tag:stardog:api:obf:> .

obf:19b37d9ffd391cd0e29ad7a0c92722e1190ab546213370807320d5e351d10b79 {
    obf:d27330fb53d3f2a6d5068ce46d248392ec09f93692a8e36a7bd33a3c128dafd4  
        obf:64b72b77e8949f32a09d38590b15e7a757a1db3bc0a186405cc4e83141b54e2a "628b49d96dcde97a430dd4f597705899e09a968f793491e4b704cae33a40dc02" ;
        obf:396dc63bd5eb69cb7a1283567c9eadd91d1319555faac5e6644314b0cf0c150d "888b19a43b151683c87895f6211d9f8640f97bdc8ef32f03dbe057c8f5e56d32" .
    obf:ac6d2bb543a71a5ec4ac0a591b8c0aafa9ad54adf0d1d686f00fabc6799ddcf9 
        obf:64b72b77e8949f32a09d38590b15e7a757a1db3bc0a186405cc4e83141b54e2a "628b49d96dcde97a430dd4f597705899e09a968f793491e4b704cae33a40dc02" ;
        obf:396dc63bd5eb69cb7a1283567c9eadd91d1319555faac5e6644314b0cf0c150d "4fac6dbe26e823ed6edf999c63fab3507119cf3cbfb56036511aa62e258c35b4" .
}

4. Obfuscate the Query

stardog query obfuscate myDatabase query.sparql > obfuscated.sparql

obfuscated.sparql

SELECT ?x0
FROM <tag:stardog:api:obf:19b37d9ffd391cd0e29ad7a0c92722e1190ab546213370807320d5e351d10b79>
WHERE {
   <tag:stardog:api:obf:d27330fb53d3f2a6d5068ce46d248392ec09f93692a8e36a7bd33a3c128dafd4>   
      <tag:stardog:api:obf:64b72b77e8949f32a09d38590b15e7a757a1db3bc0a186405cc4e83141b54e2a> ?x0 .
}

5. Executing the Obfuscated Query

  1. Create a new database (ObfuscatedDatabase) with exported obfuscated data contained in obfuscated.trig:

     stardog-admin db create -n ObfuscatedDatabase obfuscated.trig
    
  2. Execute the obfuscated query (obfuscated.sparql) against ObfuscatedDatabase:

     stardog query execute myObfuscatedDB obfuscated.sparql
     +--------------------------------------------------------------------+
     |                                 x0                                 |
     +--------------------------------------------------------------------+
     | "628b49d96dcde97a430dd4f597705899e09a968f793491e4b704cae33a40dc02" |
     +--------------------------------------------------------------------+
    

Additional Configuration

By default, all URIs, bnodes, and string literals in the database will be obfuscated using the SHA256 message digest algorithm. Non-string typed literals (numbers, dates, etc.) are left unchanged as well as URIs from built-in namespaces (e.g. RDF, RDFS, OWL, etc.). It’s possible to customize obfuscation by providing a configuration file.

stardog data obfuscate --config obfuscation.ttl myDatabase obfDatabase.ttl

See an example obfuscation configuration file in the stardog-examples Github repository.

If a custom configuration file is used to obfuscate the data, then the same configuration should be used for obfuscating the queries as well.

stardog query obfuscate --config obfuscation.ttl myDatabase myQuery.sparql > obfQuery.ttl

Message Digest Algorithm

To change the message digest algorithm used to obfuscate the data (from a default of SHA256), include the following in your obfuscation configuration file:

# Obfuscation namespace is used only for parsing the config file
@prefix obf: <tag:stardog:api:obf:> .

[] a obf:Obfuscation ;

    # Message digest algorithm that will be used to obfuscate terms
    # Should be a message digest algorithm supported by Java
    obf:digest "MD5" ;

What to Include/Exclude in Your Obfuscated Dataset

The configuration file specifies which URIs and strings will be obfuscated by defining inclusion and exclusion filters

  • Only the values that match the include pattern and do not match the exclude pattern will be obfuscated.
  • The patterns in a filter expression has a position identifier which is one of [any, subject, predicate, object].
    • The pattern will be applied to a value depending on the position of the value.
    • For example, it is possible to write filter expressions such that the same URI will be obfuscated when it is used in the subject position but not in the object position.
    • The pattern expression should be a valid Java regular expression.
    • Filter expression may just refer to a namespace which means any URI belonging to that namespace will be matched. The value of the namespace should be defined in the namespaces declaration.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

# Obfuscation namespace is used only for parsing the config file
@prefix obf: <tag:stardog:api:obf:> .

[] a obf:Obfuscation ;

    obf:include [
        obf:position obf:any ;
        obf:pattern "math"    #default is .*, to include everything
    ] ;

    obf:exclude [
		obf:position obf:any ;
		obf:namespace "rdf"
	] ;