Lightweight force-directed graph layout
The code described in this post was inspired by some practical problems in drawing relatively large highly-connected networks / graphs. Specifically, the data I have been dealing with are correlation matrices derived from gene expression measurements. A whole-genome expression dataset, as generated by microarray hybridisation or next-generation sequencing, typically covers in the order of 10,000–30,000 genes—each of which will become a vertex in the graph. While there are many software solutions claiming to be capable of laying out graphs with this number of vertices, I have repeatedly encountered problems with such datasets. This may partly be due to the highly-connected nature of correlation networks. The network I will be using as an example contains 17,868 vertices and 1,419,108 edges (after thresholding to remove weak correlations).
Some years ago, I wrote a solution in Java which performed reasonably well but, as I do almost all my work in R these days, it would be hugely convenient to have a tool to lay out such big graphs from within my R workflow. I use the wonderful igraph package for analysis and layout of smaller graphs but this does not (at least in the R version) scale well for large datasets in my experience. In the package described here, I have written a simplified force-directed layout algorithm that makes best use of R-style vectorisation and introduces a few shortcuts to minimise computation time.
The layout algorithm
The aim of the algorithm is to draw each vertex (representing a gene in my example) connected by an edge (here representing the strength of the correlation). The problem is to get the length of each edge as close as possible to the ‘optimal’ length as defined by the correlation. In fact, the optimal edge length is defined by the distance between vertices, where a distance of 0 means the vertices should be on top of each other and a distance of 1 means they should be as far away from each other as possible. Some vertex pairs do not have a defined distance, so their edge lengths are not explicity set—they tend to be pushed around depending on the other vertices and edges they are connected to. In my example, the connections between genes have been calculated as the Pearson correlation between the expression profiles of each pair of genes. Pearson correlation coefficients range from 1 (perfectly correlated; A and B have the same pattern of data, as seen when calcluating the correlation of a gene to itself) to -1 (perfectly anti-correlated; when A goes up, B goes down). A correlation of 0 means the expression of A cannot be used to predict the expression of B at all; they are completely unrelated.
Edge distances are based on the Pearson correlation—a measure of how similar the expression patterns of the vertices (genes) are. From left to right; highly correlated (correlation = 0.9, distance = 0.1), not correlated (correlation = 0, distance = 1) and inversely correlated (correlation = -0.9, distance = 1).
To calculate the distance, therefore, we can simply flip the correlation score so that a correlation of 1 (same expression) becomes a distance of 0 (very close) and a correlation of 0 (no similarity) becomes a distance of 1 (furthest apart). What happens to the negative correlations? There are different solutions, but I discard them (set them to have a distance of 1). It can be argued that a very strong negative correlation has a lot of predictive power, but the idea behind the graph layout here is to bring together the vertices that are most similar. This problem has been discussed before and my own experience has suggested that trying to cluster vertices based on negative edges is problematic—although other ideas exist.
The way I typically do this is therefore to create the correlation matrix (‘network’), and subtract all of the positive values from 1 (the R code I use performs a logical test to determine the positive values and multiplies the TRUE/FALSE matrix—which becomes 1/0 when automatically cast to numeric—with the raw correlations so that all negative values become 0).
distance = 1 - (network * (network >= 0))
There is a function, as.distance, described below that takes care of this step.
Once we have the distance matrix, the layout can begin. This is achieved by multiple rounds of checking the current edge length against the optimal distance and adjusting the edge length a little bit. In fact, all of the edges are checked at once as a vector (this avoids explicit loops and is by far the fastest way to implement this in R) and the amount that each edge needs to change in length is determined. Then the amount that the vertices at each end of the edge need to move is calculated (the ‘source’ and ‘target’ vertex each move together (if the edge needs to be shorter than it is) or apart by half the required amount). Obviously each vertex will be associated typically with many edges so that there will be many of these required moves acting on a particular vertex. The solution is simply to calculate the centroid of all of the required changes and use this to redefine the coordinates of the vertex for the current iteration. This whole process is repeated many times such that the vertices jiggle into a state where all of the ‘forces’ acting on them optimally cancel out. This is then the final layout. The algorithm I have written starts all of the vertices off at the same place (defined as the origin of the graph). Then the vertices push apart because of the ‘force’ acting to optimise the edge lengths. The angle between two vertices needs to be calculated in order to know in which direction the lengthening or shortening of the edges should occur. This is obviously not defined when two vertices are on top of each other (as in the initial state) so in this case a random angle is selected. Setting a ‘seed’ for the random number generator will make this reproducible (see details in below and in the example code). There is also another ‘force’ acting on the vertices which pushes them apart independently of the optimal distances. This repulsive force decays exponentially and only acts on vertices that are too close to each other. This has the effect of spreading unconnected vertices apart and will also drive single unconnected (‘orphan’) vertices and small subnetworks to the edge of the graph. Together, these simple rules cause the vertices to arrange themselves into a stable state where the most highly correlated genes are clustered closely together while unrelated genes get pushed apart.
The DataNet package
I have put the above algorithm, together with some helper functions, into an R package, DataNet. The package also contains a few datasets which can be used for creating test graphs for comparison or playing with parameters.
To install the package, use install.packages with the latest version of the package from my software page (at the time of writing, I just released version 1.0.1).
install.packages('http://research.rupertoverall.net/resources/datanet/DataNet_1.0.1.tar.gz', repos = NULL)
And then load the package into the workspace;
There is an example dataset (expression data for 1000 genes) included in the package. We now load this and create a distance matrix from it (just the Pearson correlation matrix converted into distances, so that vertices/genes with high positive correlations are drawn closer to each other).
data(hipExpr) network = cor(hipExpr, method="pearson", use="pairwise.complete.obs") distance.matrix = as.distance(network) # Convert to edge list edge.list = as.edge.list(distance.matrix, 0.4)
We can now actually run the layout. This code will perform 100 iterations.
# Set the seed for the random number generator set.seed(1) # Calculate coordinates coordinates = fast.force.layout(edge.list, iterations = 100)
Then the resulting coordinates can be used to plot the graph. The edges are plotted in order of their distance, so that edges with the shortest distances (strongest correlations) are plotted on top. This helps highlight the most informative structure in the graph.
# Plot graph draw.stick.graph(edge.list, coordinates)
Layout of the network at different thresholds. From left to right, distance thresholds are 0.6, 0.5 and 0.4 (corresponding to correlations above 0.4, 0.5 and 0.6 respectively).
The default colour scheme ranges from red (distance of 0 / correlation of 1) through yellow to white (distance of 1 / correlation of 0 or negative correlation).
Just for fun, I have also created a movie of the graph converging on its layout. I have run 1000 iterations for this, but it can be seen that the layout reaches a stable state after around 500 iterations.