Link Search Menu Expand Document
Start for Free

Specific Data Source Considerations

This page covers special considerations to take into account when using certain types of data sources.

Page Contents
  1. AWS Athena Virtual Graph Considerations
  2. Google BigQuery Virtual Graph Considerations
  3. MongoDB Virtual Graph Considerations
    1. Setting Unique Keys Manually
  4. MySQL Virtual Graph Considerations
  5. Elasticsearch Virtual Graph Considerations
    1. Supported Field Types
    2. Virtual Graph Mappings for Elasticsearch
    3. Automatically Generating Mappings
  6. Apache Cassandra Virtual Graph Considerations
  7. Stardog/SPARQL Virtual Graph Considerations

AWS Athena Virtual Graph Considerations

To connect to an Athena database, first download the driver for JDBC version 4.2 here. Then follow the instructions here.

The Athena JDBC driver does not expose the list of accessible databases. To save having to qualify every table name with the database name in your mappings, provide the default database in your connection URL using the Schema parameter. For example:

jdbc.url=jdbc:awsathena://athena.us-west.amazonaws.com:443;S3OutputLocation=s3://mybucket/output;Schema=mydb

Athena does not support primary (or unique) keys. This can negatively impact Stardog’s ability to create optimal queries. Use the unique.key.sets virtual graph option to define unique columns manually.


Google BigQuery Virtual Graph Considerations

To connect to a BigQuery database, first download the driver for JDBC version 4.2 here. Then follow the instructions here. You’ll need the following jar files:

  • GoogleBigQueryJDBC42.jar
  • google-api-client-1.28.0.jar
  • google-auth-library-oauth2-http-0.13.0.jar
  • gax-1.42.0.jar
  • google-api-services-bigquery-v2-rev426-1.25.0.jar
  • avro-1.9.0.jar

The connection string requires a number of parameters. For example:

jdbc.url=jdbc:bigquery://https://bigquery.googleapis.com/bigquery/v2;ProjectId=myproject;DefaultDataset=mydataset;OAuthType=0;OAuthServiceAcctEmail=vgtestaccount@myacct.iam.gserviceaccount.com;OAuthPvtKeyPath=/path/creds/bq_pk.json;Timeout=60
#jdbc.username=
#jdbc.password=

Note the jdbc.username and jdbc.password properties are not set. An example /path/creds/bq_pk.json file looks like this:

{
  "type": "service_account",
  "project_id": "myproject",
  "private_key_id": "363287712aaedb3cd20ebdc11213d43a15760768",
  "private_key": "-----BEGIN PRIVATE KEY-----\nBRIngMEMybROWnpANTsb4THenEXTbaTTLe/or\Ill8mYlUNC5+Din2tH...3er45/2ed\53fg\g+d##\n-----END PRIVATE KEY-----\n",
  "client_email": "myaccount@myid.iam.gserviceaccount.com",
  "client_id": "12345678987654321239",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/myaccount%40myid.iam.gserviceaccount.com"
}

See this for details. Note that alternative means of providing credentials may require additional jar files.

BigQuery does not support primary (or unique) keys. This can negatively impact Stardog’s ability to create optimal queries. Use the unique.key.sets virtual graph option to define unique columns manually.

MongoDB Virtual Graph Considerations

To connect to a MongoDB database, first download the client jar. The client jar for MongoDB version “x.y.z” can be obtained from http://central.maven.org/maven2/org/mongodb/mongo-java-driver/x.y.z/mongo-java-driver-x.y.z.jar Then follow the instructions here.

MongoDB has one Date type. It is stored as a 64-bit integer that represents the number of milliseconds since Jan 1, 1970, Universal Time Coordinated (UTC). This Date type can be mapped to xsd:date, xsd:dateTime or xsd:dateTimeStamp data types. (The xsd:dateTimeStamp data type is the same as xsd:dateTime except instead of having an optional timezone the timezone is required.) When a Date is mapped to either xsd:date or xsd:dateTimeStamp, it will be represented in the UTC timezone. When a Date field is mapped to an xsd:dateTime, the Date will be converted to the local timezone of the Stardog server and the label will include the timezone.

Setting Unique Keys Manually

The unique.key.sets option can be used with MongoDB, but the format is MongoDB-specific. In place of schema and table names, the unique keys must be specified in terms of the collection and a list of nested arrays.

For example, take an accounts virtual graph with this SMS2 mapping:

prefix : <http://example.com/>

MAPPING <urn:accounts>
FROM JSON {
  "accounts" : {
    "_id" : "?id",
    "acct" : "?acctNum",
    "customerName" : [ "?name" ],
    "card" : [ {
      "number" : "?ccNumber",
      "expiration" : "?ccExpr" }
    ]
  }
}
TO {
  ?holder :hasAcct ?acct .
  ?holder :hasName ?name .
}
WHERE {
  BIND (template("http://example.com/acct/{acctNum}") AS ?acct)
  BIND (template("http://example.com/holder/{ccNumber}_{name}") AS ?holder)
}

And this query:

SELECT * {
  graph <virtual://accounts> {
    ?holder :hasAcct ?acct .
    ?holder :hasName ?name .
  }
}

When Stardog translates this query, it creates a flattened view of the collection (using the $unwind stage) giving a relational view. In this example both the customerName and card arrays will be flattened because both are referenced in the template for the ?holder variable.

The plan for the example query will include a join because Stardog has no way of knowing that the card.number/customerName pair is unique. If we know that this pair of fields is indeed unique in this collection, we can make the query more efficient by adding the pair as a unique key to the unique.key.sets property:

unique.key.sets=(accounts.[card;customerName].customerName,accounts.[card;customerName].card.number)

It is required that the flattened arrays are listed in alphabetical order, are separated by semicolons, and are enclosed in square brackets. For nested arrays, use periods to delimit the names (level1.level2).

Multiple key sets can be separated with commas. For example, if we also know that the acct field is unique, the property value becomes:

unique.key.sets=(accounts.[].acct),(accounts.[card;customerName].customerName,accounts.[card;customerName].card.number)

Read more about Native MongoDB Support


MySQL Virtual Graph Considerations

As of Java 1.8 update 291 TLS 1.0 and 1.1 have been disabled. Using the MySQL connector with this Java version or later requires setting the parameter enabledTLSProtocols=TLSv1.2 on your JDBC connection string. For example:

jdbc.url=jdbc:mysql://localhost/dept?enabledTLSProtocols=TLSv1.2

Alternatively, if you do not wish to use SSL to encrypt traffic between Stardog and MySQL you can set useSSL=false:

jdbc.url=jdbc:mysql://localhost/dept?useSSL=false

See the Java 8u291 release notes for more details.


Elasticsearch Virtual Graph Considerations

To create an Elasticsearch virtual graph, you need to download the Elasticsearch client jar along with two supporting jars. The client jar for Elasticsearch version “x.y.z” can be obtained from https://repo1.maven.org/maven2/org/elasticsearch/client/elasticsearch-rest-client/x.y.z/elasticsearch-rest-client-x.y.z.jar

Two supporting jars are also required:

  1. http async client
  2. http core nio

Supported Field Types

Stardog supports the following Elasticsearch field types:

  • keyword
  • long
  • integer
  • short
  • byte
  • double
  • float
  • half_float
  • scaled_float
  • date

Note that only the keyword data type is supported for strings. Strings indexed as text cannot be mapped for the purpose of SPARQL/GraphQL querying.

Virtual Graph Mappings for Elasticsearch

To create virtual graph mappings for Elasticsearch, use SMS2 mapping syntax with the FROM JSON clause.

There are two types of mappings being discussed here: Stardog Virtual Graph mappings that describe how Elasticsearch fields are mapped to RDF and Elasticsearch mappings that define a schema for an Elasticsearch index.

For an index named simple with a single Elasticsearch mapping type named _doc and an Elasticsearch mapping like:

{
  "simple" : {
    "mappings" : {
      "_doc" : {
        "properties" : {
          "StudentId" : {
            "type" : "integer"
          },
          "StudentName" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

an example Stardog mapping could look like this:

prefix ex: <http://example.com/>

MAPPING <urn:example>
FROM JSON {
  "simple._doc":{
    "_id": "?docId",
    "StudentId": "?id",
    "StudentName.keyword": "?name"
  }
}
TO {
  ?studentIri a ex:Student ;
    rdfs:label ?name ;
    ex:esId ?docId .
}
WHERE {
  bind(template("http://example.com/{id}") as ?studentIri)
}

The top-level key in the FROM JSON (simple._doc) is formed by joining the index name and the Elasticsearch mapping type with a period. This is similar to the schemaName.tableName convention that is used for SQL databases. As a shorthand, for indexes with only one mapping type, the mapping type can be omitted. In this example, simple._doc can be replaced with simple assuming _doc is the only mapping type.

For Elasticsearch versions 6 and later, indexes are allowed only one mapping type, where the name of the mapping type defaults to _doc. For version 5 it is possible for an index to have more than one mapping type. See Removal of mapping types in the Elasticsearch online documentation for details.

Notice in the above example that the built-in _id field is mapped. Stardog knows that the _id field is unique across all documents in the index and it uses this information to simplify the queries it generates. Stardog is not able to determine the uniqueness of any other fields but if you know certain fields (or combinations of fields) are unique you can indicate which field sets are unique in the configuration options.

For example, suppose we know that StudentId is in fact a unique field. We can tell Stardog so by setting the unique.key.sets configuration option:

unique.key.sets=(simple._doc.StudentId)

or if the simple index has only the one mapping type:

unique.key.sets=(simple.StudentId)

Automatically Generating Mappings

Elasticsearch indexes have well-defined schemas. Stardog can use that schema information to automatically generate virtual graph mappings to RDF. By default, the generated templates for the IRIs will be based on the Elasticsearch mapping type names, which are _doc for all indexes on recent versions of Elasticsearch. This makes the IRIs difficult to distinguish. To address this, Stardog defaults the schema.in.generated.mappings to true when generating virtual graph mappings for Elasticsearch.


Apache Cassandra Virtual Graph Considerations

To create a Cassandra virtual graph, you’ll need to first download the shaded Cassandra client jar. The client jar for Cassandra version “x.y.z” can be obtained from http://central.maven.org/maven2/com/datastax/cassandra/cassandra-driver-core/x.y.z/cassandra-driver-core-x.y.z-shaded.jar

Cassandra is special in the way it attempts to prevent users from distributing queries over a large number of server nodes. If you have experience with CQL queries, you have no doubt seen the ubiquitous error message:

Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING

This reflects the Cassandra modeling principle that favors writing the same data to multiple tables (perhaps through the use of Materialized Views), where each table is optimized for answering different queries.

In order to support as many queries as possible, we recommend creating mappings to each of these tables and letting Stardog choose which mappings apply for each query. It is possible that no mappings can support a particular query. In such cases, Stardog will write an entry to the log file and return no results.

This is the default behavior, which can be changed by setting the cassandra.allow.filtering option to true. When set, Stardog will include the ALLOW FILTERING clause at the end of each CQL query. Please note that the use of this option is highly discouraged in large-scale production environments.

Cassandra is also special for how SQL-like its query language is (for a NoSQL database). As this is the case, Stardog supports the use of SQL queries in the mappings files for Cassandra virtual graphs. That is, you can use the rr:sqlQuery predicate for R2RML mappings, the sm:query predicate for Stardog Mapping Syntax, or the FROM SQL clause for Stardog Mapping Syntax 2. In all cases, you can supply a SQL query to describe a view to use for a virtual graph mapping, however, the SQL query can only contain operators that are supported in CQL - no joins, subqueries, SQL functions, etc. are allowed.


Stardog/SPARQL Virtual Graph Considerations

To create Stardog Engine or SPARQL service virtual graph, the properties file should specify values for the following properties:

Property Description
sparql.url SPARQL endpoint with database specified, for example, http://myhost:26023/testdb/query
sparql.username username to access the SPARQL endpoint
sparql.password password to access the SPARQL endpoint
sparql.graphname default graph to run the queries against
sparql.statsbasedoptimization boolean value to enable/disable statistics based optimization, only valid for Virtual Graphs to other Stardog servers

Virtual graph mappings are not supported.