Agile Java Man: FP and Big Data

Distributed computing relies surprisingly heavily on mathematics.

Associativity

This just says that

A . (B . C) = (A . B) . C

for any operator '.'

Note that addition is obviously associative when subtraction isn't. However, there is a clever hack around this.

"Subtraction is not an associative operation. But it is associative under addition, where subtraction is defined to be the inverse operation, ⁻¹ :N→N:n↦−n together with addition. This way, I can do subtraction, via addition of inverses, where 3 - 5 is desugared to 3 + 5⁻¹.

"Subtraction as a standalone function is not associative, because 3−(−5−2)≠(3−(−5))−2. But addition is, in fact, principled, associative, commutative, preserves identity, respects inverses, and is semantically what you want.

"We care about associativity ... in the context of distributed programming ... because it means we can arbitrarily partition data. We care about identity because it means we know that recursion will terminate when we reach the identity element (it is a fixed point modulo our operation), and we know that if a function is also commutative, then we can arbitrarily combine our results.

"This is the model that Spark uses underneath the hood - you'll notice it only takes commutative monoidal functions for its jobs (i.e. sets of data with an operation that has an identity within the dataset, and is associative + commutative). The algebra of distributed programming is quite simple."

[Emily Pillmore, Gitter]

"In this context this means that if we have pipelines A, B and C, and we wish to join them to form a single pipeline ABC, we can either join A and B and then join the result AB and C, or we could join A to pipeline BC. " [Stephen Zoio's blog]

Commutivity

"Suppose you wrote this:

someNumbers.reduce(_ - _)

What would you expect the result to be? The short answer is: we don’t know. Since the operation is not associative and is not commutative, we have broken both constraints we need to have this operation work well. " [Nicholos Tietz-Sokolosky's blog]

Domain

"Unless the set A [to which binary operations are applied] is chosen carefully, they [binary operations] may not always be defined. For example, if one restricts one's attention to the positive integers, then the expression 3-5 has no meaning. There are two conventions one could imagine adopting to this. One might decide not to insist that a binary operations should be defined for every pair of elements of A, and to regard it as a desirable extra property of an operation if it is defined everywhere. But the convention actually in force is that binary operations do have to be defined everywhere, so that "minus," though a perfectly good binary operation on the set of all integers, is not a binary operation on the set of all positive integers." [Princeton Companion to Mathematics]

Similarly, a bad choice of domain can blow up in something like Spark. If you're adding Ints, their total will not exceed the maximum value, right?

Composability

"Whenever one is looking for a general solution to composability, monads are normally not too far away. By composability, we mean the output of one process is the input into another. And that is precisely what we are trying to do. A data processing pipeline consists of several operations, each joined together to form the pipeline. And we can join two or more pipelines to form larger pipelines."
[Stephen Zoio]

Zoio gives an example where wrapping calls to Spark in Monads makes things more re-usable but although the article is otherwise excellent, I found this unconvincing as each Spark call was tied to another. Instead, I refactored his Scalaz/Monad code to make it a little more divisible here.

Equality

Rob Norris in his "Gazing at the Stars" lecture makes some interesting points. He describes how a simple type alias of type Angle = Double caused thousands of dollars of damage as it was not sufficient for all comparisons. (The lecture is largely about mapping between different units).

At 11'34" into the video, Norris tells us Scala 2 lets you compare anything for equality, for example a URL and a Classloader. That's a bug that should be fixed in Scala 3. FP programmers prefer type classes as a solution for example Eq (pronounced "eek").

Note that Norris uses CatsSuite for testing. His code for the lecture can be found here, here and here. Norris' use of CatsSuite throws a lot of random numbers at the unit tests to try and flush out errors.

Monoids

The mathematical definition of Monoids and some Scalaz examples can be found in a previous post of mine here.

Note that even Haskell doesn't (can't?) enforce the Monoid rules. "There must be a value that acts as the identity with respect to the binary function and that the binary function must be associative. It's possible to make instances of Monoid that don't follow these rules... Haskell doesn't enforce these laws so we need to be careful that our instances do indeed obey them." [LYAHFGG]

The best way I've seen so far to address this issue is by the types giving a hint. The function "reduceLeft is not parallelizable, as it explicitly assumes a form that is not associative: (B,A) => B. As long as you use an associative operator, reduce is parallelizable." [SO] Whereas, (B,B) => B would imply associativity.

You need a monoid for fold operations to work properly and a commutative semigroup for reduce. "Commutative semigroups consists of all those semigroups in which the binary operation satisfies the commutativity property that ab = ba for all elements a and b in the semigroup." [Wikipedia]. Monoids are a specialization of semigroups that have an identity element.

Optimization

Obeying the FP laws allows Spark etc to optimize the code on our behalf.

"We can do this because we know our programs are invariant under certain substitutions. Referential transparency gives us the most basic rule: we can inline anything, and factor anything out. Lawful abstractions give us more substitution rules for our toolbox. Functor tells us that fa.map(f).map(g) = fa.map(f andThen g) for instance, so we get an optimization opportunity that we can apply freely in our programs." [tpolecat Jan 09 17:33, Gitter]

However, as with all performance testing, it's an empirical science. Using the almost canonical example of an AST for basic arithmetic, Noel Walsh ("Unite Church and State") says: "each constructor in the FP implementation becomes a method in the OO implementation. This removes the need for the pattern matching. The transformation is known as Church encoding.

"We can go the opposite way as well: convert every method in the OO representation into a case class in a sealed trait (an algebraic data type), and then use a pattern match to chose the action to take (a structural recursion). This transformation is known as reification.

"We can think of OO and FP not as entirely different programming paradigms but choices we make to encode a solution to the problem according to the tradeoffs we want to make. Let’s see an example where where this transformation is useful. One aspect ... is performance. In the FP representation we must allocate memory to hold the data structure that represents the operations we want to perform."

Frameless

The Frameless project brings a more FP approach to using Dataframes. Now, one can already use the more strongly-typed Datasets to give you compile-time error checking but "unfortunately, this syntax does not allow Spark to optimize the code." Frameless apparently gets around that problem will still giving type safety.

However, I did change a field name but foolishly did not change it elsewhere in the codebase where I called partitionBy. I asked on their Gitter channel how Frameless could save me from this but was told:

Cody Allen @ceedubs Jan 12 19:19
@PhillHenry as far as I can tell, Frameless doesn't specifically have support for that partitionBy method. But in general that is the sort of problem that frameless solves.

So, maybe a future pull request...?

Agile Java Man

Tuesday, January 15, 2019

FP and Big Data

No comments:

Post a Comment

Blog Archive

About Me