# Graph Algorithms

This page discusses the details of the graph analytics algorithms. We go over the algorithms supported and the input parameters needed to run these algorithms.

Page Contents

## Overview

Graph algorithms let users analyze the contents of a graph and the relationships between the nodes in a way that is not possible with SPARQL. Graph algorithms can be used for ranking nodes in a graph or detecting communities based on the connectivity of the nodes. Most of the graph algorithms work iteratively either for a fixed number of steps or until a convergence condition is met.

Graph analytics algorithms work by leveraging the Stardog Spark connector. The computation starts by submitting a Spark job that specifies the algorithm to run along with various input parameters. Spark job retrieves graph data from Stardog and runs the specified algorithm. The graph may be the complete contents of a Stardog database or a subset as defined in the input parameters. The algorithm outputs a number for each node in the input graph. The meaning of this number depends on the algorithm run and is explained below. The output numbers are then saved back to the Stardog database by connecting the node to the output value with a property.

Several input parameters apply to all algorithms whereas some input parameters are specific to some algorithms. In the following sections we describe the input parameters and the algorithms. See the Input Parameters section for how to set these parameters and the Usage for passing these parameters for execution.

## General parameters

There are several parameters that are general to all analytics algorithms. These parameters are explained in the following categories based on the purpose they serve.

### Connection parameters

The following parameters specify how to connect to Stardog, which database to use and the graph data that will be retrieved from that database.

The parameters in bold are required. If a value for a parameter is not provided the default value will be used.

Parameter Default Value Description
stardog.server N/A Stardog server address, e.g. http://example.com:5820
stardog.database N/A Stardog database name
stardog.query construct { ?s ?p ?o }
from <tag:stardog:api:context:local>
where { ?s ?p ?o }
Query used to pull data from Stardog. The default query pulls all the data stored in the database (including the default graph and all named graphs). User can override this query with any valid CONSTRUCT query. A stored query name can be provided here instead of a query string.
stardog.query.timeout N/A Timeout that will be used for running the query. If no value is specified the timeout defined in the database settings will be used. The default timeout might need to be increased if the query pulls large amounts of data. If the timeout is set low then pulling the data would fail and Spark would retry as many times as specified in Spark settings
stardog.reasoning false Boolean flag to enable reasoning for the query. If set to `true` the default schema will be used for reasoning.
stardog.reasoning.schema N/A Specify a reasoning schema name for the query. If this parameter is set `stardog.reasoning` parameter will not have any effect and reasoning will be enabled.

### Output parameters

Parameter Default Value Description
output.property N/A The Property IRI that will be used to link the node in the Stardog database to the output computed by the graph analytics algorithm. If no value is provided the property name will be auto-generated as `tag:stardog:api:<AlgorithmName>` based on the algorithm selected.
output.graph DEFAULT The named graph where the results will be saved to. The default graph will be used if no value is specified.
output.dryrun false The boolean flag to control if output will be written to Stardog. If set to `true` graph analytics algorithm will run but no information will be saved back to Stardog. Information such the number of results computed will be printed in the logs.

### Spark parameters

Parameter Default Value Description
spark.dataset.partitions 0 Number of partitions that will be used to parallelize data ingestion from Stardog. Each partition corresponds to a query with a different offset, and these queries are issued in parallel. This setting does not affect the `spark.sql.shuffle.partitions` or `spark.default.parallelism` settings. Has no effect unless `spark.dataset.size` is specified.
spark.dataset.size 0 Total dataset size (number of triples). Needs to be calculated in advance but can be an approximate estimation. Works in combination with `spark.dataset.partitions` to split the data into roughly equal partitions.
spark.dataset.repartition 0 Number of partitions that will be used to parallelize the data before passing it to graph algorithm. Unlike `spark.dataset.partitions`, does not cause multiple queries but allows for better distribution of data across Spark worker nodes. If set to 0 (default), number of nodes in the cluster will be used.
spark.checkpoint.dir stardog-graph-analytics The checkpoint dir that will be used by the algorithms. Currently, only the Connected Components algorithm uses checkpointing. The files created in this directory will not be automatically cleaned by the Stardog connector. You can use the Spark option `spark.cleaner.referenceTracking.cleanCheckpoints` for automatic cleanup.

## Graph Algorithms

### Page Rank

PageRank is an algorithm to measure the importance of nodes in a graph based on the quantity and quality of incoming links. The higher number of incoming links increases the importance of a node, i.e. its rank, but links from highly-ranked nodes will also have bigger effect.

The PageRank algorithm is defined recursively where the weights associated with nodes are updated at every iteration. The algorithm can be run either for a fixed number of iterations or until the values of nodes converge where no rank is being updated more than a threshold value.

Set `algorithm.name=PageRank` to run this algorithm. The output will be a double value representing the rank computed for each node.

The following input parameters can be specified for PageRank:

Parameter Default Value Description
algorithm.iterations 3 Maximum number of iterations the algorithm will run.
pagerank.tolerance N/A The tolerance value allowed at convergence. If this option is set, PageRank algorithm will run until all the weights are smaller than this value. so setting this value smaller will yield more accurate results but will take longer to compute.
pagerank.reset.probability 0.15 Reset probability that will be used.

### Label Propagation

Label Propagation is a community detection algorithm. Each node in the graph is initially assigned to its own community. At every iteration, nodes send their community affiliation to all neighbors and update their state based on the community affiliation of incoming messages.

Label Propagation is not computationally expensive but convergence is not guaranteed. It is also possible to get trivial results where each node gets assigned to a single community.

Set `algorithm.name=LabelPropagation` to run this algorithm. The output will be a long value identifying the community affiliation for each node.

Parameter Default Value Description
algorithm.iterations 3 Maximum number of iterations the algorithm will run.

### Strongly Connected Components

Strongly Connected Components is a kind of community detection algorithm. A strongly connected component is a set of nodes where each node is reachable from every other node by following the edges respecting the edge direction.

In the following graph, there are three strongly connected components. Nodes A, B and C form a strongly-connected component. Nodes D and E form separate singular components. Set `algorithm.name=StronglyConnectedComponents` to run this algorithm. The output will be a long value identifying the component each node belongs to. The nodes with the same output value are in the same component.

Parameter Default Value Description
algorithm.iterations 3 Maximum number of iterations the algorithm will run.

### Connected Components

Connected Components is a kind of community detection algorithm. A connected component is a set of nodes where each node is reachable from every other node by following the edges ignoring the edge direction.

In the following graph, there are two connected components. Node A, B and C form one connected component. Nodes D and E form another connected component. Set `algorithm.name=ConnectedComponents` to run this algorithm. The output will be a long value representing the component each node belongs to. The nodes with the same output value are in the same component.

Parameter Default Value Description
algorithm.iterations 3 Maximum number of iterations the algorithm will run.

### Triangle Count

Triangle Count is the number of triangles a node is a member of in the graph. A triangle is defined as the three nodes connected to each other without taking the direction of the edges into account.

Set `algorithm.name=TriangleCount` to run this algorithm. The output will be a long value representing the number of triangles.

This algorithm has no input parameters.