Obfuscating Data
This page discusses obfuscating datasets and queries in Stardog to prevent sharing sensitive data.
Page Contents
What is Data Obfuscation?
- Data obfuscation is a mechanism that exports data in an obscure, unclear and unintelligible dataset while keeping the overall structure of the data intact.
- Data obfuscation is very secure. Once the data is obfuscated, it is impossible to return to the original data.
Why use Data Obfuscation?
Data obfuscation can help by easily producing a dataset and query to:
- submit bug reports using sensitive data
- share senstiive RDF data with others
How to Obfuscate Data and Queries
The data obfuscate
command is used obfuscate data. It is very similar to the data export
command:
stardog data obfuscate -f TRIG -g ALL myDatabase /tmp/obfuscated.trig
- In the above example, all named graphs are exported in the TriG format from the database (
myDatabase
) to a file (/tmp/obfuscated.trig
). - The exported data can be loaded into any Stardog database.
Once the data is obfuscated, queries written against the original data will no longer work against the obfuscated data. The query obfuscate
command is used to obfuscate queries.
# redirects obfuscated query to obfuscated.sparql
stardog query obfuscate myDatabase query.sparql > obfuscated.sparql
- The obfuscated query (
obfuscated.sparql
) can then be executed against the database with the obfuscated data loaded into it.
Example
The following example shows how to obfuscate a small dataset and query.
1. Database Setup
-
Create a file
sample.trig
:<http://example.com/graph> { <http://example.com/test/1> <http://example.com/attribute/name> "t1" . <http://example.com/test/1> <http://example.com/attribute/id> "0001" . <http://example.com/test/2> <http://example.com/attribute/name> "t1" . <http://example.com/test/2> <http://example.com/attribute/id> "0002" . }
-
Create a database (
myDatabase
) and loadsample.trig
to it:stardog-admin db create -n myDatabase sample.trig
2. Run the Query Against Sample Database
-
Create a file (
query.sparql
) that contains the following SPARQL query:SELECT ?name FROM <http://example.com/graph> WHERE { <http://example.com/test/1> <http://example.com/attribute/name> ?name }
-
Execute the query:
stardog query execute myDatabase query.sparql +-------+ | name | +-------+ | "t1" | +-------+
3. Obfuscate the Data
stardog data obfuscate -f TRIG -g ALL myDatabase obfuscated.trig
obfuscated.trig
@prefix obf: <tag:stardog:api:obf:> .
obf:19b37d9ffd391cd0e29ad7a0c92722e1190ab546213370807320d5e351d10b79 {
obf:d27330fb53d3f2a6d5068ce46d248392ec09f93692a8e36a7bd33a3c128dafd4
obf:64b72b77e8949f32a09d38590b15e7a757a1db3bc0a186405cc4e83141b54e2a "628b49d96dcde97a430dd4f597705899e09a968f793491e4b704cae33a40dc02" ;
obf:396dc63bd5eb69cb7a1283567c9eadd91d1319555faac5e6644314b0cf0c150d "888b19a43b151683c87895f6211d9f8640f97bdc8ef32f03dbe057c8f5e56d32" .
obf:ac6d2bb543a71a5ec4ac0a591b8c0aafa9ad54adf0d1d686f00fabc6799ddcf9
obf:64b72b77e8949f32a09d38590b15e7a757a1db3bc0a186405cc4e83141b54e2a "628b49d96dcde97a430dd4f597705899e09a968f793491e4b704cae33a40dc02" ;
obf:396dc63bd5eb69cb7a1283567c9eadd91d1319555faac5e6644314b0cf0c150d "4fac6dbe26e823ed6edf999c63fab3507119cf3cbfb56036511aa62e258c35b4" .
}
4. Obfuscate the Query
stardog query obfuscate myDatabase query.sparql > obfuscated.sparql
obfuscated.sparql
SELECT ?x0
FROM <tag:stardog:api:obf:19b37d9ffd391cd0e29ad7a0c92722e1190ab546213370807320d5e351d10b79>
WHERE {
<tag:stardog:api:obf:d27330fb53d3f2a6d5068ce46d248392ec09f93692a8e36a7bd33a3c128dafd4>
<tag:stardog:api:obf:64b72b77e8949f32a09d38590b15e7a757a1db3bc0a186405cc4e83141b54e2a> ?x0 .
}
5. Executing the Obfuscated Query
-
Create a new database (
ObfuscatedDatabase
) with exported obfuscated data contained inobfuscated.trig
:stardog-admin db create -n ObfuscatedDatabase obfuscated.trig
-
Execute the obfuscated query (
obfuscated.sparql
) againstObfuscatedDatabase
:stardog query execute myObfuscatedDB obfuscated.sparql +--------------------------------------------------------------------+ | x0 | +--------------------------------------------------------------------+ | "628b49d96dcde97a430dd4f597705899e09a968f793491e4b704cae33a40dc02" | +--------------------------------------------------------------------+
Additional Configuration
By default, all URIs, bnodes, and string literals in the database will be obfuscated using the SHA256 message digest algorithm. Non-string typed literals (numbers, dates, etc.) are left unchanged as well as URIs from built-in namespaces (e.g. RDF, RDFS, OWL, etc.). It’s possible to customize obfuscation by providing a configuration file.
stardog data obfuscate --config obfuscation.ttl myDatabase obfDatabase.ttl
See an example obfuscation configuration file in the stardog-examples Github repository.
If a custom configuration file is used to obfuscate the data, then the same configuration should be used for obfuscating the queries as well.
stardog query obfuscate --config obfuscation.ttl myDatabase myQuery.sparql > obfQuery.ttl
Message Digest Algorithm
To change the message digest algorithm used to obfuscate the data (from a default of SHA256
), include the following in your obfuscation configuration file:
# Obfuscation namespace is used only for parsing the config file
@prefix obf: <tag:stardog:api:obf:> .
[] a obf:Obfuscation ;
# Message digest algorithm that will be used to obfuscate terms
# Should be a message digest algorithm supported by Java
obf:digest "MD5" ;
What to Include/Exclude in Your Obfuscated Dataset
The configuration file specifies which URIs and strings will be obfuscated by defining inclusion and exclusion filters
- Only the values that match the include pattern and do not match the exclude pattern will be obfuscated.
- The patterns in a filter expression has a position identifier which is one of [
any
,subject
,predicate
,object
].- The pattern will be applied to a value depending on the position of the value.
- For example, it is possible to write filter expressions such that the same URI will be obfuscated when it is used in the subject position but not in the object position.
- The pattern expression should be a valid Java regular expression.
- Filter expression may just refer to a namespace which means any URI belonging to that namespace will be matched. The value of the namespace should be defined in the namespaces declaration.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
# Obfuscation namespace is used only for parsing the config file
@prefix obf: <tag:stardog:api:obf:> .
[] a obf:Obfuscation ;
obf:include [
obf:position obf:any ;
obf:pattern "math" #default is .*, to include everything
] ;
obf:exclude [
obf:position obf:any ;
obf:namespace "rdf"
] ;