Thursday, August 20, 2015

Spark Shuffle

... is something you should avoid.  "All shuffle data must be written to disk and then transferred over the network" (from here). This can be very expensive on a large data set. If the data is evenly distributed but a query asks for it to be grouped according to an arbitrary criteria, much movement will ensue. Therefore, groupByKey should be avoided if possible as the ScalaDocs indicate this can be an expensive operation.

However, reduceByKey is much more efficient as it will "perform the merging locally on each mapper before sending results to a reducer" resulting in a much smaller transfer of data between nodes.

So, when we were writing a risk engine with Spark holding all our user-defined Record objects, our code would look something like:

def totalByKey(data: RDD[Record]): RDD[(String, Double)] = { => record.key -> record.value).reduceByKey(_ + _)

which would first turn our Records into key/value pairs then add up all the values for each key without transferring large amounts of data over the network. It avoids this by adding all the values for Records on each node before adding all these subtotals. Only the addition of subtotals requires a network call.

Interestingly, Spark can then do an outer join like this:

val subtotalForDate1 = totalByKey(dataForDate1)
val subtotalForDate2 = totalByKey(dataForDate2)
val outerJoing       = subtotalForDate1.fullOuterJoin(subtotalForDate2)

which returns an RDD of tuples containing the key, the left and the right value. Using this join, we can compare our risk profile from one date to the next.

Note that both reduceByKey and fullOuterJoin don't actually live in the RDD class but the PairRDDFunctions class. They appear to be part of the RDD class by the magic of implicits when the RDD in question contains tuples of pairs.

One last thing to remember is that if you're trying to be clever and reduce collections of elements, you're no better off than using groupByKey. The function reduceByKey is only useful if you are reducing something like a number. If you look at the code for groupByKey you'll see that it's basically doing just that: reducing collections.

Further Reading

Nice article on memory usage.

No comments:

Post a Comment