Friday, April 1, 2016

The Big Data Landscape


Having praised the speed of GraphX, it now seems I may have to eat my words. When I tried to use the method to find the Strongly Connected Components, it was incredibly slow. The method takes a parameter for how many iterations it is to run for. An increasing number found more SCCs but with ever-increasing times (22s, 62s, 194s, 554s, 1529s, ... for my admittedly naff laptop). An off-the-shelf library performed a perfect calculation in sub-second time.

It appears I'm not the only one to suffer:
"We conducted some tests for using Graphx to solve the connected-components problem and were disappointed.  On 8 nodes of 16GB each, we could not get above 100M edges.  On 8 nodes of 60GB each, we could not process 1bn edges.  RDD serialization would take excessive time and then we would get failures.  By contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a single 16GB node in about an hour.  I think that a very large cluster will do better, but we did not explore that."

Alternatives

So, over the Easter holidays, I had a quick look for what else is out there.

Giraph: "Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections."

It has an SCC calculator here.

Hama: "It provides not only pure BSP programming model but also vertex and neuron centric programming models, inspired by Google's Pregel and DistBelief."

Pregelix: "Pregelix is an open-source implementation of the bulk-synchronous vertex-oriented programming model (Pregel API) for large-scale graph analytics."

A comparison between their performance (and GraphX's) can be found here.

Incidentally, there is code to test GraphX's ability to find SCCs here and libraries for various graph analysis algorithms here (but sadly nothing that gives a fast SCC algorithm).


In-memory graph analysis libraries

My graph should (just about) fit into memory on a big, big box (more than 100gb of memory), so I looked at what was out there.

Jung: "is a software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network."

Signal/Collect: "The intuition behind the Signal/Collect programming model is that computations are executed on a graph, where the vertices are the computational units that interact by the means of signals that flow along the edges. This is similar to the actor programming model. All computations in the vertices are accomplished by collecting the incoming signals and then signaling the neighbors in the graph."

JGraphT: "JGraphT is a free Java graph library that provides mathematical graph-theory objects and algorithms."

Boost: This library is written in C++ but is open source and well established.


Parallelism?

The problem is that although my graph can just about fit into memory, processing it will be slow if the libraries are single-threaded. Only Boost offered a parallel algorithm for calculating SCCs.

My implementation using JGraphT was great for relatively small graphs but when the number of vertices were roughly one million, I was looking at about 10 minutes to calculate SCCs. When the number of vertices was 4 million, it took an hour.

"[T]here is currently no library for parallel graph algorithms in Java available" as of 2011 (see here). They are well established (see here) but it appears that nobody has written one for the JVM. If they did, they'd have to take into account the most memory-efficient Scala and Java data structures to represent the graph if there was a hope of fitting large ones into memory.

No comments:

Post a Comment