Spark support neural nets but the variety is quite limited. For instance, somebody asked on the Spark mailing list whether Recurrent Neural Networks would soon be implemented. To which the answer was "The short answer is there is none and highly unlikely to be inside of Spark MLlib any time in the near future."
Deep learning in YARN
Since we're looking at getting serious with neural nets in Spark, we were forced to look at alternatives. A Yahoo! offering is the one we will soon try. It promises to run TensorFlow within Yarn. This also suits our hardware as it looks as if we may be able to exploit Remote Direct Memory Access - sharing memory between machines without it going via the (slow) OS. "Infiniband provides faster connectivity and supports direct access to other servers’ memories over RDMA. Current TensorFlow releases, however, only support distributed learning using gRPC over Ethernet. To speed up distributed learning, we have enhanced the TensorFlow C++ layer to enable RDMA over Infiniband" the Yahoo! team claim.
Another alternative is just to deploy a library that is dedicated to exotic types of neural nets and DeepLearning4J looks ideal for that. "Data parallelism shards large datasets and hands those pieces to separate neural networks, say, each on its own core. Deeplearning4j relies on Spark for this, training models in parallel and iteratively averages the parameters they produce in a central model. (Model parallelism ... allows models to specialize on separate patches of a large dataset without averaging.)"
(From "Large Scale Distributed Deep Networks", Dean et al)
Now, regarding GPUs note what the author of netlib-java, the binding that exploits BLAS, says: "Be aware that GPU implementations have severe performance degradation for small arrays."
Pure Spark implementation
So, while we're preparing to install all these libraries, I had time to play with Spark's implementation of neural networks on the public domain newsgroups data I spoke about before. There were some interesting findings.
The first was that a lot of the work is done on the driver. This seems to happen a lot in some of the older Spark code (something I've complained about before). Although there is parallelism in o.a.s.m.o.CostFun where the gradient of partitioned features in an RDD are calculated by adding them altogether, other calculations appears to be done locally in the driver. For instance, the updating of this potentially very large result from the executors is done in o.a.s.m.ann.ANNUpdater and the BFGS calculation appears to take place entirely in Breeze (see breeze.optimize.FirstOrderMinimizer.infiniteIterations).
Consequently, I was forced to give the driver 30gb of memory and set spark.driver.maxResultSize to 8192. Only then could I run against a MultilayerPerceptronClassifier with layer sizes [262144, 100, 80, 20]. Note, these sizes come from the size of my word vector calculated by TF-IDF being 262144 and there being only 20 different newsgroups. The 100 and 80 were guessed by slowly increasing them until I hit memory problems.
Results
Now, with this neural net architecture I was able to achieve about 81% accuracy in about 30 minutes (an architecture with [262144, 50, 40, 20] was much faster and didn't require giving the driver more memory but had an accuracy of about 77%).
But this does not compare favourably with avoiding neural nets altogether. A naive Bayes classifier took no time at all, didn't require much memory and gave me about 84% accuracy. Simplest thing that works, eh?