Machine Learning

This page discusses Stardog’s machine learning capabilities.

Page Contents

Overview
Predictive Analytics
Learning a Model
Making Predictions
Query Syntax Restrictions
Assessing Model Quality
Modeling Data
1. Data Representation
2. Data Types
Mastering the Machine

Overview

In this section, you’ll learn how to use Stardog’s machine learning capabilities for the general problem of predictive analytics. We’ll show you how to build a machine learning model and use it for prediction, plus best practices on modeling your data and improving the quality of results.

Check out our machine learning tutorial and an introductory blog article.

Predictive Analytics

Suppose you have data about movies. But that data is incomplete; some movies are missing the genre field. Filling out that missing data is time consuming, and you would like to do it automatically using all the information you already have about the movies. This is where Stardog’s predictive analytics comes into the game. Using the data you have about movies with genre, you can create a machine learning model that will predict the genre for the movies that are missing it. Isn’t that sweet?

Supervised learning is the basis of this capability. You give Stardog some data about the domain you’re interested in, and it will learn a model that can be used to make predictions about properties of that data.

Learning a Model

First step is learning a model, by defining which data will be used in the learning and the target that we are actually trying to predict.

With Stardog, all this is naturally done via SPARQL. The best way to understand the syntax is through an example. Here, we learn a model to predict the genre of a movie given its director, year, and studio.

prefix spa: <tag:stardog:api:analytics:>

INSERT {
  graph spa:model {
    :myModel  a spa:ClassificationModel ;
              spa:arguments (?director ?year ?studio) ;
              spa:predict ?genre .
  }
}
WHERE {
   ?movie :directedBy ?director ;
          :year ?year ;
          :studio ?studio ;
          :genre ?genre .
}

The WHERE clause selects the data and a special graph, spa:model, is used to specify the parameters of the training. :myModel is the unique identifier given to this model and is composed of 3 mandatory properties.

First, we need to define the type of learning we are performing:

Type	Property	Description
classification	`spa:ClassificationModel`	if we are interested in predicting a categorical value that has a limited set of possible values (e.g., genre of a movie)
regression	`spa:RegressionModel`	if we predict a numerical value that can naturally have an unlimited set of values (e.g., box office of a movie)
similarity	`spa:SimilarityModel`	if we want to predict the degree of similarity between two objects (e.g., most similar movies)

The second property, spa:arguments, defines the variables from the WHERE clause that will be used as features when learning the model. Here is where you define the data that you think will help to predict the third property, given by spa:predict.

In this case, our model will be trained to predict the value of ?genre based on the values of ?director , ?year, and ?studio.

Properly defining these 3 properties is the main task when creating any model. Using more advanced parameters is covered below in the Mastering the Machine section.

Making Predictions

Now that we’ve learned a model, we can move on to more exciting stuff and use it to actually predict things.

prefix spa: <tag:stardog:api:analytics:>

SELECT * WHERE {
  graph spa:model {
      :myModel  spa:arguments (?director ?year ?studio) ;
                spa:predict ?predictedGenre .
  }

  :TheGodfather :directedBy ?director ;
                :year ?year ;
                :studio ?studio ;
                :genre ?originalGenre .
}

We select a movie’s properties and use them as arguments to the model Stardog previously learned. The magic comes with the ?predictedGenre variable; during query execution, its value is not going to come from the data itself (like ?originalGenre), but will instead be predicted by the model, based on the values of the arguments.

The result of the query will look like this:

| director            | year | studio             | originalGenre | predictedGenre |
| ------------------- | ---- | ------------------ | ------------- | -------------- |
| :FrancisFordCoppola | 1972 | :ParamountPictures | Drama         | Drama          |

Our model seems to be predicting correctly the genre for The Godfather. Yee!

Query Syntax Restrictions

At this point, only basic graph patterns can be used directly inside the prediction query. If more advanced constructs, like OPTIONAL or FILTER, are necessary, that part of the query needs to be in a sub-query, e.g.:

prefix spa: <tag:stardog:api:analytics:>

SELECT * WHERE {
  graph spa:model {
      :myModel  spa:arguments (?director ?year ?studio) ;
                spa:predict ?predictedGenre .
  }

  {
    SELECT * WHERE {
        ?movie  :directedBy ?director ;
                :year ?year ;
                :genre ?originalGenre .
        OPTIONAL { ?movie :studio ?studio }
        FILTER (?year > 2000)
    }
  }
}

Assessing Model Quality

Metrics

We provide some special aggregate operators that help quantify the quality of a model.

For classification and similarity problems, one of the most important measures is accuracy, that is, the frequency that we predict the target variable correctly.

prefix spa: <tag:stardog:api:analytics:>

SELECT (spa:accuracy(?originalGenre, ?predictedGenre) as ?accuracy) WHERE {
  graph spa:model {
      :myModel  spa:arguments (?director ?year ?studio) ;
                spa:predict ?predictedGenre .
  }

  ?movie  :directedBy ?director ;
          :year ?year ;
          :studio ?studio ;
          :genre ?originalGenre .
}

+---------------------+
| accuracy            |
| ------------------- |
| 0.92488254018       |
+---------------------+

For regression, we provide three different measures:

Mean absolute error: on average, how far away is the prediction from the real target number: spa:mae(?originalValue, ?predictedValue)
Mean square error: on average, how much is the squared difference between prediction and the target number: spa:mse(?originalValue, ?predictedValue)
Root mean square error: the square root of the mean square error: spa:rmse(?originalValue, ?predictedValue)

Automatic Evaluation

Classification and regression models are automatically evaluated with the data used in their training. The score and respective metric can be queried from spa:model.

prefix spa: <tag:stardog:api:analytics:>

SELECT * WHERE {
  graph spa:model {
    :myModel  spa:evaluationMetric ?metric ;
              spa:evaluationScore ?score .
  }
}

+------------------------------------+-------+
|               metric               | score |
+------------------------------------+-------+
| tag:stardog:api:analytics:accuracy | 1.0   |
+------------------------------------+-------+

By default, spa:accuracy is used for classification problems, and spa:mae for regression. This metric can be changed during model learning, by setting the spa:evaluationMetric argument.

prefix spa: <tag:stardog:api:analytics:>

INSERT {
  graph spa:model {
    :myModel  a spa:RegressionModel ;
              spa:evaluationMetric spa:rmse ;
              ...
  }
}
...

Cross Validation

The default automatic evaluation technique of measuring the accuracy of the model on the same data as training might be prone to overfitting. The most accurate measure we can have is testing on data that the model has never seen before.

We provide a spa:crossValidation property, which will automatically apply K-Fold cross validation on the training data, with the number of folds given as an argument.

prefix spa: <tag:stardog:api:analytics:>

INSERT {
  graph spa:model {
    :myModel  a spa:RegressionModel ;
              spa:crossValidation 10 ;
              spa:evaluationMetric spa:rmse ;
              ...
  }
}
  ...

prefix spa: <tag:stardog:api:analytics:>

SELECT * WHERE {
   graph spa:model {
       :myModel  spa:evaluation ?validation ;
                 spa:evaluationMetric ?metric ;
                 spa:evaluationScore ?score .
      }
}

+-------------+------------------------------------+-------+
| validation  |               metric               | score |
+-------------+------------------------------------+-------+
| "KFold=10"  | tag:stardog:api:analytics:rmse     | 0.812 |
+-------------+------------------------------------+-------+

Modeling Data

The way you input data into Stardog during model learning is of utmost importance in order to achieve good quality predictions.

Data Representation

For better results, each individual you are trying to model should be encoded in a single SPARQL result.

For example, suppose you want to add information about actors into the previous model. The query selecting the data would look as follow:

SELECT * WHERE {
   ?movie :actor ?actor ;
          :directedBy ?director ;
          :year ?year ;
          :studio ?studio ;
          :genre ?genre .
}

+---------------+---------------+---------------------+------+--------------------+--------+
| movie         | actor         | director            | year | studio             | genre  | 
| ------------- | ------------- | ------------------- | ---- | ------------------ | ------ |
| :TheGodfather | :MarlonBrando | :FrancisFordCoppola | 1972 | :ParamountPictures | Drama  |
| :TheGodfather | :AlPacino     | :FrancisFordCoppola | 1972 | :ParamountPictures | Drama  |
+---------------+---------------+---------------------+------+--------------------+--------+

Due to the nature of relational query languages like SPARQL, results are returned for all the combinations between the values of the selected variables.

In order to properly model relational domains like this, we introduced a special aggregate operator, set. Used in conjunction with GROUP BY, we can easily model this kind of data as a single result per individual.

prefix spa: <tag:stardog:api:analytics:>

SELECT ?movie (spa:set(?actor) as ?actors) ?director ?studio ?genre WHERE {
   ?movie :actor ?actor ;
          :directedBy ?director ;
          :year ?year ;
          :studio ?studio ;
          :genre ?genre .
} 
GROUP BY ?movie ?director ?studio ?genre

+---------------+---------------------------+---------------------+------+--------------------+--------+
| movie         | actors                    | director            | year | studio             | genre  | 
| ------------- | ------------------------- | ------------------- | ---- | ------------------ | ------ |
| :TheGodfather | [:MarlonBrando :AlPacino] | :FrancisFordCoppola | 1972 | :ParamountPictures | Drama  |
+---------------+---------------------------+---------------------+------+--------------------+--------+

Data Types

Carefully modelling your data with the correct datatypes can dramatically increase the quality of your model.

Stardog has special treatment for values of the following types:

Numbers, such as xsd:int, xsd:short, xsd:byte, xsd:float, and xsd:double, are treated internally as weights and properly model the difference between values
Strings, xsd:string and rdf:langString, are tokenized and used in a bag-of-words fashion
Sets, created with the spa:set operator, are interpreted as a bag-of-words of categorical features
Booleans, xsd:boolean, are modeled as binary features

Everything else is modeled as categorical features.

Setting the correct data type for the target variable, given through spa:predict, is extremely important:

with regression, make sure values are numeric
with classification, individuals of the same class should have consistent data types and values
with similarity, use values that uniquely identify an object, e.g., an IRI

For everything else, using the datatype that is closer to its original meaning is a good rule of thumb.

Mastering the Machine

Let’s look at some other issues around the daily care and feeding of predictive analytics and models in Stardog.

Overwriting Models

By default, you cannot create a new model with the same identifier as an already existent one. If you try to do so, you’ll be greeted with a Model already exists error.

In order to reuse an existent identifier, users can set the spa:overwrite property to True. This will delete the previous model and save the new one in its place.

prefix spa: <tag:stardog:api:analytics:>

INSERT {
  graph spa:model {
    :myModel  a spa:RegressionModel ;
              spa:overwrite True ;
              ...
  }
}
  ...

Deleting Models

Finding good models is an iterative process, and sometimes you’ll want to delete your old—not as awesome and now unnecessary—models. This can be achieved with DELETE DATA and the spa:deleteModel property applied to the model identifier.

prefix spa: <tag:stardog:api:analytics:>

DELETE DATA {
  graph spa:model {
      [] spa:deleteModel :myModel .
  }
}

Classification and Similarity with Confidence Levels

Sometimes, besides predicting the most probable value for a property, you will be interested to know the confidence of that prediction. By providing the spa:confidence property, you can get confidence levels for all the possible predictions.

prefix spa: <tag:stardog:api:analytics:>

SELECT * WHERE {
  graph spa:model {
      :myModel  spa:arguments (?director ?year ?studio) ;
                spa:confidence ?confidence ;
                spa:predict ?predictedGenre .
  }

  :TheGodfather :directedBy ?director ;
          :year ?year ;
          :studio ?studio .
}
ORDER BY DESC(?confidence)
LIMIT 3

| director            | year | studio             | predictedGenre | confidence     |
| ------------------- | ---- | ------------------ | -------------- | -------------- |
| :FrancisFordCoppola | 1972 | :ParamountPictures | Drama          | 0.649688932    |
| :FrancisFordCoppola | 1972 | :ParamountPictures | Crime          | 0.340013045    |
| :FrancisFordCoppola | 1972 | :ParamountPictures | Sci-fi         | 0.010298023    |

These values can be interpreted as the probability of the given prediction being the correct one and are useful for tasks like ranking and multi-label classification.

Tweaking Parameters

Both Vowpal Wabbit and similarity search can be configured with the spa:parameters property.

prefix spa: <tag:stardog:api:analytics:>

INSERT {
  graph spa:model {
    :myModel  a spa:ClassificationModel ;
              spa:library spa:VowpalWabbit ;
              spa:parameters [
                spa:learning_rate 0.1 ;
                spa:sgd True ;
                spa:hash 'all'
              ] ;
              spa:arguments (?director ?year ?studio) ;
              spa:predict ?genre .
  }
}
...

Parameter names for both libraries are valid properties in the spa prefix, and their values can be set during model creation.

Vowpal Wabbit

By default, models are learned with [ spa:loss_function "logistic"; spa:probabilities true; spa:oaa true ] in classification mode, and [ spa:loss_function "squared" ] in regression. Those parameters are overwritten when using the spa:arguments property with regression, and appended in classification.

Check the official documentation for a full list of parameters. Some tips that might help with your choices:

Use cross-validation when tweaking parameters. Otherwise, make sure your testing set is not biased and represents a true sample of the original data.
The most important parameter to tweak is the learning rate spa:l. Values between 1 and 0.01 usually give the best results.
To prevent overfitting, set spa:l1 or spa:l2 parameters, preferably with a very low value (e.g., 0.000001).
If number of distinct features is large, make sure to increase the number of bits spa:b to a larger value (e.g., 22).
Each argument given with spa:arguments has its own namespace, identified by its numeric position in the list (starting with 0). For example, to create quadratic features between ?director and ?studio, set spa:q "02".
If caching is enabled (e.g., with spa:passes), always use the [ spa:k true; spa:cache_file "fname" ] parameters, where fname is a unique filename for that model.
In regression, the target variable given with spa:predict is internally normalized into the [0-1] range, and denormalized back to its normal range during query execution. For certain problems where numeric arguments have large values, performance might be improved by performing a similar normalization as a pre-processing step.

Similarity Search

The underlying algorithm is based on cluster pruning, an approximate search algorithm which groups items based on their similarity in order to speed up query performance.

The minimum number of items per cluster can be configured with the spa:minClusterSize property, which is set to 100 by default.

prefix spa: <tag:stardog:api:analytics:>

INSERT {
  graph spa:model {
    :myModel  a spa:SimilarityModel ;
              spa:parameters [
                spa:minClusterSize 100 ;
              ] ;
              spa:arguments (?director ?year ?studio) ;
              spa:predict ?movie .
  }
}
...

This number should be increased with datasets containing many near-duplicate items.

During prediction, there are two parameters available:

spa:limit, which restricts the number of top N items to return; by default, it returns only the top item, or all items if using spa:confidence.
spa:clusters, which sets the number of similarity clusters used during the search, with a default value of 1. Larger numbers will increase recall, at the expense of slower query time.

For example, the following query will return the top 3 most similar items and their confidence scores, restricting the search to 10 clusters.

prefix spa: <tag:stardog:api:analytics:>

SELECT * WHERE {
  graph spa:model {
    :myModel  spa:parameters [
                spa:limit 3 ;
                spa:clusters 10 .
              ] ;
              spa:confidence ?confidence ;
              spa:arguments (?director ?year ?studio) ;
              spa:predict ?similar .
  }
}
...

Hyperparameter Optimization

Finding the best parameters for a model is a time consuming, laborious, process. Stardog helps to ease the pain by performing an exhaustive search through a manually specified subset of parameter values.

prefix spa: <tag:stardog:api:analytics:>

INSERT {
  graph spa:model {
    :myModel  a spa:ClassificationModel ;
              spa:library spa:VowpalWabbit ;
              spa:parameters [
                spa:learning_rate (0.1 1 10) ;
                spa:hash ('all' 'strings')
              ] ;
              spa:arguments (?director ?year ?studio) ;
              spa:predict ?genre .
  }
}
...

All possible sets of parameter configurations that can be built from the given values (spa:learning_rate 0.1 ; spa:hash 'all', spa:learning_rate 1 ; spa:hash 'all', and so on) will be evaluated. The best configuration will be chosen, and its model saved in the database.

Afterwards, parameters are available for querying, just like any other model metadata.

prefix spa: <tag:stardog:api:analytics:>

SELECT * WHERE {
    graph spa:model {
        :myModel  spa:parameters [ ?parameter ?value ]
    }
}

+-------------------+-------+
|     parameter     | value |
+-------------------+-------+
| spa:hash          | "all" |
| spa:learning_rate | 1     |
+-------------------+-------+

Native Library Errors

Stardog ships with a pre-compiled version of Vowpal Wabbit (VW) that works out of the box with most MacOSX/Linux 64bit distributions.

If you have a 32 bit operating system, or an older version of Linux, you will be greeted with a Unable to load analytics native library error when trying to create your first model.

Exception in thread "main" java.lang.RuntimeException: Unable to load analytics native library. Please refer to http://www.stardog.com/docs/#_native_library_errors
    at vowpalWabbit.learner.VWLearners.loadNativeLibrary(VWLearners.java:94)
    at vowpalWabbit.learner.VWLearners.initializeVWJni(VWLearners.java:76)
    at vowpalWabbit.learner.VWLearners.create(VWLearners.java:44)
    ...
Caused by: java.lang.RuntimeException: Unable to load vw_jni library for Linux (i386)

In this case, you will need to install VW manually. Fear not! Instructions are easy to follow.

git clone https://github.com/cpdomina/vorpal.git
cd vorpal/build-jni/
./build.sh
sudo cp transient/lib/vw_wrapper/vw_jni.lib /usr/lib/libvw_jni.so

You might need to install some dependencies, namely zlib-devel, automake, libtool, and autoconf.

After this process is finished, restart the Stardog server and everything should work as expected.