Obfuscating Data
This page discusses obfuscating datasets and queries in Stardog to prevent sharing sensitive data.
Page Contents
What is Data Obfuscation?
- Data obfuscation is a mechanism that exports data in an obscure, unclear and unintelligible dataset while keeping the overall structure of the data intact.
- Data obfuscation is very secure. Once the data is obfuscated, it is impossible to return to the original data.
Why use Data Obfuscation?
Data obfuscation can help by easily producing a dataset and query to:
- submit bug reports using sensitive data
- share sensitive RDF data with others
How to Obfuscate Data and Queries
The data obfuscate
command is used obfuscate data. It is very similar to the data export
command:
- In the above example, all named graphs are exported in the TriG format from the database (
myDatabase
) to a file (/tmp/obfuscated.trig
). - The exported data can be loaded into any Stardog database.
Once the data is obfuscated, queries written against the original data will no longer work against the obfuscated data. The query obfuscate
command is used to obfuscate queries.
- The obfuscated query (
obfuscated.sparql
) can then be executed against the database with the obfuscated data loaded into it.
Example
The following example shows how to obfuscate a small dataset and query.
1. Database Setup
-
Create a file
sample.trig
: -
Create a database (
myDatabase
) and loadsample.trig
to it:
2. Run the Query Against Sample Database
-
Create a file (
query.sparql
) that contains the following SPARQL query: -
Execute the query:
3. Obfuscate the Data
obfuscated.trig
4. Obfuscate the Query
obfuscated.sparql
5. Executing the Obfuscated Query
-
Create a new database (
ObfuscatedDatabase
) with exported obfuscated data contained inobfuscated.trig
: -
Execute the obfuscated query (
obfuscated.sparql
) againstObfuscatedDatabase
:
Additional Configuration
By default, all URIs, bnodes, and string literals in the database will be obfuscated using the SHA256 message digest algorithm. Non-string typed literals (numbers, dates, etc.) are left unchanged as well as URIs from built-in namespaces (e.g. RDF, RDFS, OWL, etc.). It’s possible to customize obfuscation by providing a configuration file.
See an example obfuscation configuration file in the stardog-examples Github repository.
If a custom configuration file is used to obfuscate the data, then the same configuration should be used for obfuscating the queries as well.
Message Digest Algorithm
To change the message digest algorithm used to obfuscate the data (from a default of SHA256
), include the following in your obfuscation configuration file:
What to Include/Exclude in Your Obfuscated Dataset
The configuration file specifies which URIs and strings will be obfuscated by defining inclusion and exclusion filters
- Only the values that match the include pattern and do not match the exclude pattern will be obfuscated.
- The patterns in a filter expression has a position identifier which is one of [
any
,subject
,predicate
,object
].- The pattern will be applied to a value depending on the position of the value.
- For example, it is possible to write filter expressions such that the same URI will be obfuscated when it is used in the subject position but not in the object position.
- The pattern expression should be a valid Java regular expression.
- Filter expression may just refer to a namespace which means any URI belonging to that namespace will be matched. The value of the namespace should be defined in the namespaces declaration.