Agile Java Man: FP and Big Data part 2

Saturday, March 30, 2019

FP and Big Data part 2

Functional Programming gives guarantees that Big Data tools can leverage. For example, in the event of an aggregation, "such as SUM, expressible as an associative and commutative operator (who said “monoid”?), it can be executed more efficiently, e.g. the framework can push some aggregation into the first map. In Hadoop, such an aggregation is called a combiner." (Eugene Kirpichov at Medium.com)

Note that Spark's [SO] reduce (and indeed Scala's too) is over commutative semigroups when it should be on just semigroups. "The class of commutative semigroups consists of all those semigroups in which the binary operation satisfies the commutativity property that ab = ba for all elements a and b in the semigroup." (Wikipedia).

[In Scala, "reduceLeft and foldLeft apply the operator from left to right (head to tail). reduceRight and foldRight apply it from right to left. As for reduce and fold, the order is unspecified, which is what allows to have parallel implementations for it (see ParIterableLike). As a consequence, you'd better be sure that your operator is associative [and commutative - think String concatenation] when calling reduce, otherwise the result will not be deterministic." (StackOverflow)]

This clearly is not true for some obvious semigroups (for instance, string concatenation with the empty string being the "zero" element). But it's not necessarily true for Java primitives either.

Haskell doesn't have an issue with floating point folding:

$ ghci
GHCi, version 7.10.3: http://www.haskell.org/ghc/ :? for help
Prelude> foldr (\x y -> x + y) 0 [1.0, 0.05, 0.05]
1.1
Prelude> foldl (\x y -> x + y) 0 [1.0, 0.05, 0.05]
1.1

Whereas, Scala (due to the IEEE754 standard) gives this:

val floats = Seq(1.0f, 0.05f, 0.05f)
println(floats.sum) // 1.0999999
println(floats.reduce(_ + _)) // 1.0999999
println(floats.reduceLeft(_ + _)) // 1.0999999
println(floats.reduceRight(_ + _)) // 1.1

And it's not just floating point arithmetic that has this problem. If you exceed the maximum value for the Integer and Long, you also will suffer misery.

So be careful. Certain FP rules are not enforced in the JVM.

Agile Java Man

Saturday, March 30, 2019

FP and Big Data part 2

No comments:

Post a Comment

Blog Archive

About Me