Agile Java Man: April 2016

Saturday, April 23, 2016

Graphs as Matrices - part 2

As we saw in the last post, treating graphs as matrices gives us a lot of tools to analyse them without traversing them.

One handy tool is being able to detect cycles. If we take the adjacency matrix in the last post, we can calculate the number of paths from i to j and back to i by taking all the outgoing connections from i then AND them with all the incoming connection to i. Since an element in the adjacency matrix is 0 if and only if there is no connection, then multiplication acts like an AND operator.

All the outgoing connections for node i are represented by the i-th row of matrix A (that is A_i,j) and all the incoming connections to node i are represented by the i-th column (that is A_j,i). Multiplying each outgoing value with the corresponding incoming and adding them all up (to give us the total number of paths) is:

	n
number of cycles between i and j =	Σ	a_i,j a_j,i
	j = 1

which if you look at the Wikipedia page for matrix multiplication, is just the i-th diagonal of the matrix A multiplied by itself.

If we wanted to find the number of cycles if we took 3 steps not 2, then the equation becomes:

	n	n
number of cycles between i and j via k =	Σ	Σ	a_i,j a_j,k a_k,i
	j = 1	k = 1

which is just the i-th diagonal of A³.

More generally, the non-diagonal values (A^x_i,j where i ≠ j) are also of interest. They show the number of paths from i to j after x steps.

Now, taking our matrix representing an acyclic matrix, let's keep multiply it by itself (in Python):

>>> B = re_arranged_rows_and_cols
>>> np.dot(B, B)
matrix([[0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
>>> np.dot(B, np.dot(B, B))
matrix([[0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
>>> np.dot(B, np.dot(B, np.dot(B, B)))
matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
>>> np.dot(B, np.dot(B, np.dot(B, np.dot(B, B))))
matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
>>> np.dot(B, np.dot(B, np.dot(B, np.dot(B, np.dot(B, B)))))
matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
>>>

Ah. Because it's acyclic, there is a limit to how many steps we can take (4, apparently). It's not too surprising if you look at the graph and try to find the longest path possible by eye:

G = nx.from_numpy_matrix(A, create_using=nx.MultiDiGraph())
nx.draw(G, with_labels=True)
plt.show()

(The thick end of the connection indicates its direction like an arrow).

Now, let's introduce a loop from 2 to 0:

>>> B[2,0] = 1
>>> B
matrix([[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

so the graph now looks like:

We also note that the eigenvalues are no longer all 0 (as predicted by my previous post).

>>> eig(B)[0]
array([-0.5+0.8660254j, -0.5-0.8660254j, 1.0+0.j , 0.0+0.j ,
0.0+0.j , 0.0+0.j , 0.0+0.j , 0.0+0.j ,
0.0+0.j , 0.0+0.j , 0.0+0.j , 0.0+0.j ,
0.0+0.j ])

The total number of cycles can be calculated by the trace of the matrix resulting from repeated multiplication:

>>> B.trace()
matrix([[0]])
>>> np.dot(B, B).trace()
matrix([[0]])
>>> np.dot(B, np.dot(B, B)).trace()
matrix([[3]])

There is one caveat. "Note that this expression counts separately loops consisting of the same vertices in the same order but with different starting points. Thus the loop 1 -> 2 -> 3 -> 1 is considered different from the loop 2 -> 3 -> 1 -> 2." (Networks - An Introduction, Newman).

Looking at the matrix tells you who is in a cycle (see the diagonals). For instance, this is the matrix after 3 hops looks like this:

>>> np.dot(B, np.dot(B, B))
matrix([[1, 0, 0, 0, 0, 0, 0, 0, 3, 1, 0, 0, 0],
[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

Note the value of A_0,8has the value of 3 and if you look at the picture above, you can see that there are indeed 3 different ways to get from 0 to 8 given 3 hops.

Friday, April 22, 2016

Graphs as Matrices - part 1

Graphs are typically represented as adjacency lists. But if they're represented as matrices, you get all the juicy goodness of linear algebra.

What does this mean? Well, for example, let's represent this directed graph as a matrix.

Digraph from Segewick's "Algorithms".

Note that it has no cycles, so we can represent it like this:

The same digraph.

We can represent this as an n x n matrix (A) where
A_i,j is 1 when i is linked to j and 0 otherwise. Since there are no self-loops, A_i,i= 0.

In Python, it would look like this:

import numpy as np
from numpy.linalg import eig
from scipy.linalg import lu
import networkx as nx
import sympy

A = np.matrix( # 0 1 2 3 4 5 6 7 8 9 10 11 12
'0 1 0 0 0 1 1 0 0 0 0 0 0;' # 0
'0 0 0 0 0 0 0 0 0 0 0 0 0;' # 1
'1 0 0 1 0 0 0 0 0 0 0 0 0;' # 2
'0 0 0 0 0 1 0 0 0 0 0 0 0;' # 3
'0 0 0 0 0 0 0 0 0 0 0 0 0;' # 4
'0 0 0 0 1 0 0 0 0 0 0 0 0;' # 5
'0 0 0 0 1 0 0 0 0 1 0 0 0;' # 6
'0 0 0 0 0 0 1 0 0 0 0 0 0;' # 7
'0 0 0 0 0 0 0 1 0 0 0 0 0;' # 8
'0 0 0 0 0 0 0 0 0 0 1 1 1;' # 9
'0 0 0 0 0 0 0 0 0 0 0 0 0;' # 10
'0 0 0 0 0 0 0 0 0 0 0 0 1;' # 11
'0 0 0 0 0 0 0 0 0 0 0 0 0' # 12
)

It's worth noting that the eigenvalues of this matrix are all 0:

print "eigenvalues = ", eig(A)[0] # [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

This is always true of adjacency matrices that are acyclic. I'll explain why.

Firstly, if we re-label the nodes, we can change the order of the rows and columns such that it becomes an upper triangular matrix. Given an order that I calculated elsewhere:

newOrder = [2, 0, 1, 3, 5, 8, 7, 6, 4, 9, 10, 11, 12]

re_arranged_cols = A[newOrder, :]

re_arranged_rows_and_cols = re_arranged_cols[:, newOrder]

we get a matrix that looks like this:

[[0 1 0 1 0 0 0 0 0 0 0 0 0]

[0 0 1 0 1 0 0 1 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 1 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 1 0 0 0 0]

[0 0 0 0 0 0 1 0 0 0 0 0 0]

[0 0 0 0 0 0 0 1 0 0 0 0 0]

[0 0 0 0 0 0 0 0 1 1 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 1 1 1]

[0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 0 1]

[0 0 0 0 0 0 0 0 0 0 0 0 0]]

which also has all zero eigenvalues:

print "eigenvalues for this new matrix: ", eig(re_arranged_rows_and_cols)[0]<-- all="" as="" expected="" font="" zeroes="">

Notice that with this new choice of basis, the matrix has all the connections (that is, the 1 values) above the diagonal and the diagonal values themselves are all zeros (since even with this re-ordering there are still no self-loops).

Triangular Matrices

Now, an interesting thing about upper triangular matrices is that their diagonals are their eigenvalues. The reason for this follows this recipe:

1. The elements in an upper triangular matrix, a_ij are necessarily 0 if i < j (where i and j are the row and column indexes). That is:

a_ij = 0 ∀ {i, j | i < j }

2. A formula for calculating the determinants is this one from Leibniz:

			n
det(A) =	Σ	sgn(σ)	Π	a_σ(i),i
	σ ∈ S _n		i = 1

where

n is the order of the square matrix (2 if it's 2x2; 3 if it's 3x3 ...)
σ is a permutation of n integers in the permutation group S_n.
sgn(σ) is 1 or -1 depending on the order of σ. Ignore this. It will disappear.
and σ(i) is the i-th integer in σ.

Take a look at the rightmost element of (2). It multiplies n elements of the matrix given by a_xi . But remember from (1) that the product is 0 is any i is less than x. If you think about it, there is only one permutation for which this is not true, namely [1, 2, 3, ... n]. So, for a triangular matrix,

	n
det(A) =	Π	a_i,i
	i = 1

that is, the determinant is the product of the diagonals.

Given that you calculate the eigenvalues from the characteristic equation:

det(A - λ I) = 0

(where I is the identity matrix) then the eigenvalues are given by:

(λ - a_1,1) . (λ - a_2,2) . (λ - a_3,3) ... (λ - a_n,n) = 0

The only way for this equation to hold for all eigenvalues, λ, is that each eigenvalue must equal its corresponding diagonal element. Since the diagonals of a triangular matrix representing an acyclic graph are all 0, all its eigenvalues are 0.

Conclusion

We've shown that just by looking at the adjacency matrix (and not tediously exploring the graph) we can tell whether the graph is cyclic or acyclic. This is done by calculating the eigenvalues which is simple given any decent maths library.

Friday, April 1, 2016

The Big Data Landscape

Having praised the speed of GraphX, it now seems I may have to eat my words. When I tried to use the method to find the Strongly Connected Components, it was incredibly slow. The method takes a parameter for how many iterations it is to run for. An increasing number found more SCCs but with ever-increasing times (22s, 62s, 194s, 554s, 1529s, ... for my admittedly naff laptop). An off-the-shelf library performed a perfect calculation in sub-second time.

It appears I'm not the only one to suffer:

"We conducted some tests for using Graphx to solve the connected-components problem and were disappointed. On 8 nodes of 16GB each, we could not get above 100M edges. On 8 nodes of 60GB each, we could not process 1bn edges. RDD serialization would take excessive time and then we would get failures. By contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a single 16GB node in about an hour. I think that a very large cluster will do better, but we did not explore that."

Alternatives

So, over the Easter holidays, I had a quick look for what else is out there.

Giraph: "Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections."

It has an SCC calculator here.

Hama: "It provides not only pure BSP programming model but also vertex and neuron centric programming models, inspired by Google's Pregel and DistBelief."

Pregelix: "Pregelix is an open-source implementation of the bulk-synchronous vertex-oriented programming model (Pregel API) for large-scale graph analytics."

A comparison between their performance (and GraphX's) can be found here.

Incidentally, there is code to test GraphX's ability to find SCCs here and libraries for various graph analysis algorithms here (but sadly nothing that gives a fast SCC algorithm).

In-memory graph analysis libraries

My graph should (just about) fit into memory on a big, big box (more than 100gb of memory), so I looked at what was out there.

Jung: "is a software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network."

Signal/Collect: "The intuition behind the Signal/Collect programming model is that computations are executed on a graph, where the vertices are the computational units that interact by the means of signals that flow along the edges. This is similar to the actor programming model. All computations in the vertices are accomplished by collecting the incoming signals and then signaling the neighbors in the graph."

JGraphT: "JGraphT is a free Java graph library that provides mathematical graph-theory objects and algorithms."

Boost: This library is written in C++ but is open source and well established.

Parallelism?

The problem is that although my graph can just about fit into memory, processing it will be slow if the libraries are single-threaded. Only Boost offered a parallel algorithm for calculating SCCs.

My implementation using JGraphT was great for relatively small graphs but when the number of vertices were roughly one million, I was looking at about 10 minutes to calculate SCCs. When the number of vertices was 4 million, it took an hour.

"[T]here is currently no library for parallel graph algorithms in Java available" as of 2011 (see here). They are well established (see here) but it appears that nobody has written one for the JVM. If they did, they'd have to take into account the most memory-efficient Scala and Java data structures to represent the graph if there was a hope of fitting large ones into memory.

Agile Java Man

Saturday, April 23, 2016

Graphs as Matrices - part 2

Friday, April 22, 2016

Graphs as Matrices - part 1

Friday, April 1, 2016

The Big Data Landscape

Blog Archive

About Me