Agile Java Man: April 2020

Monday, April 20, 2020

Graph databases and Data Collisions

Back in 2016, we were comparing Neo4J and OrientDB graph databases. I was tasked with exploring OrientDB and engaged the CEO, Luca Garulli, to help me bulk load a huge amount of data (our chat can be found here).

In both cases, there is a problem with inserting edges in that each edge insertion can update two vertices. Let's assume there are no edges connecting the same vertex to itself (there really isn't in our data), so each insertion updates exactly two vertices. Well, if two edges are inserted at the same time there is the possibility of a collision. This is a version of the famous Birthday Problem.

We were advised by the OrientDB guys to batch our edge insertions. But what's a good batch size? A high number would mean greater throughput but also a higher chance of collision that would ultimately force us to retry.

Let's say there are N vertices that have already been written to the DB and our batch size for writing edges is b. What are the chances of a collision for a given batch of edges?

Firstly, we need to find the expected number of collisions within a batch of size b. These are OK as they won't force us to retry since they are part of the same transaction.

Here, we note a nice trick outlined here.

The expectation of a sum is the sum of the expectations. This theorem holds "always." The random variables you are summing need not be independent.

So, the expected number of collisions, E(X) is

E(X) = E(collisions for first vertex) + E(collisions for second vertex) + ...

The expected value of anything is the sum of all possible values multiplied by their probability. So, the expected number of collisions for each vertex is

E(collision of vertex x) = 0 * p(no collision) + 1 * p(collision) = p(collision)

Note that the expected value need not be an integer nor indeed sensible. The expected number when throwing a 6-sided die is 3.5.

Given n vertices, this number becomes:

E(X) = bp

where the probability is given in the link above, that is

p(collision) = 1 - (1 - (1/N))^b-1

For anything reasonable, this is pretty small. For instance, if there are 1 million vertices and we choose a batch size of 1000, the expected number of collisions within a batch is just less than 1.

Monday, April 13, 2020

Applicative and Effectful Types

Typical use case: we have several legacy java.io.OutputStreams to close after we have done some work. We want to try to close all of them even if some might fail to close. Can we encapsulate these individual units of work in monads?

Recall a key distinction between the type classes Applicative and Monad - Applicative captures the idea of independent computations, whereas Monad captures that of dependent computations. Put differently Applicatives cannot branch based on the value of an existing/prior computation. Therefore when using Applicatives, we must hand in all our data in one go. [Cats documentation]

So, not only is monad inappropriate, it's impossible to perform this use case with monads at all.

Ok, let's look more closely at these Applicatives:

Rob Norris @tpolecat There's already a common misconception that applicative composition implies order-independence, which isn't necessarily (or even commonly) true. I feel like we're still kind of struggling for a way to talk about effects. I am, anyway.
Fabio Labella @SystemFw Because most education resources state that Applicative does represent non ordered computation whereas it can represents non ordered computation.
Rob Norris @tpolecat Even Validated is order-dependent

[From Gitter]

My reading of this is that calling Apply's *> will "Compose two actions, discarding any value produced by the first." (Cats Apply Scaladoc). The order is important in that all but the last result is thrown away.

To demonstrate this, I wrote some code here [GitHub] that uses Validated from Cats which is an Applicative as its code [GitHub] proves here.

My code uses a filthy var but as even the esteemed Fabio Labella "you need a Ref, but not a Ref of aWriter, just a Ref with IO. You can also use a var, whether you use one or another often depends on how complex the test is (var works well for simple cases with no concurrency)". So, with that caveat, let's proceed.

var i = 0
def failure(): Validated[String, String] = {
println("failure")
i = i + 1
Invalid("failure message")
}
def success(): Validated[String, String] = {
println("valid")
i = i + 1
Valid("success message")
}

success() *> failure() *> success()
i shouldBe 3

This passes. Now, let's look at a monadic version in the same test class:

type MonadType = Either[String, Int]
val aye: MonadType = Right(1)
val nay: MonadType = Left("nope")

aye *> nay *> aye shouldBe nay

You can demonstrate if you like that the last aye is not actually called but this is left as an exercise.

ZIO

ZIO is another Effects system. It is totally independent of Cats (unlike Monix, yet another Effects system, which depends on Cats).

"A quick summary: efforts to use Pure Functional Programming in Scala began with Scalaz lib which was then continued by Cats. These libraries are great, but in essence they try to replicate the whole Haskell experience in Scala. This comes at a cost mainly because Haskell is a non-strict, lazy-by-default language while the JVM is both eager and strict, so in order to make some things work some techniques were used that affect your performance and Scala's ability to infer types.
ZIO strives to give you a PFP experience but taking adavantage of Scala's paradigms (such as variance) to help both with performance and type inference.
it also tries to be more friendly for newcomers" ToxicaFunk (12 April 2020) on Discourse

Now, although all monads are applicatives, not all applicatives are monads (cats.data.Validated being an example of such a type). "We sometimes use the terms monadic effects or applicative effects to mean types with an associated Monad or Applicative instance." [Functional Programming in Scala]

In ZIO, there is "one monad to rule them all", also called ZIO. So, how do you get Applicative behaviour? None other than ZIO's creator himself answered me on Discourse:

jdegoes 12 April 2020 at 2:10 PM
@PhillHenry x1.ignore &> x2.ignore &> x3. This will execute x1, x2, and x3 in parallel, using zipRightPar (that's the &> operator, zips two effects together in parallel, returning whatever is produced on the right), and ignore the result of x1 and x2 so their failures don't influence the result of the computation.
@PhillHenry In ZIO, even parallel zip or collect or foreach operations will "kill" the other running effects if one of them fails. Because that's often what you want. To get the other behavior, just use .ignore in the right places to ignore the failures you don't care about.

My equivalent ZIO code can be found here on GitHub. Looking at the thread names and stack traces, these operations do appear to execute on different thread/fibres. This is not the case on the Cats implementation.

Wednesday, April 8, 2020

Mathematical Proofs for Distributed Systems

Software is becoming more about proofs. With well written FP code, you can prove at compile time that your software cannot enter certain pathological states.

Well, the world of architecture is moving in that direction, too.

Introducing TLA+

TLA+ is an attempt to help architects and developers to write bullet-proof systems. It's particularly useful in the domain of distributed computing.

"Distributed algorithms and protocols are hard to get right, especially, when they have to tolerate faults. One of the reasons is that distributed algorithms vary in their assumptions about thedistributed system: the communication medium, system synchrony, possible faults, etc. ... [TLA] offers a rich syntaxfor sets, functions, tuples, records, and sequences on top of first-order logic"

TLA+ Model Checking Made Symbolic

It's already being used in architecting Kafka.

Propositional Logic

In propositional logic, "there are only two values, TRUE and FALSE. To learn how to compute with these values, all you need to know are the ... definitions of the five Boolean operators"[1]: ∧, ∨, ¬, ≡ and ⇒.

The first four are AND, OR, NOT and EQUALS, so nothing special there. The last one (the implies operator) is tricksy. Its truth table looks like this:

F	G	F⇒G
TRUE	TRUE	TRUE
FALSE	TRUE	TRUE
FALSE	FALSE	TRUE
TRUE	FALSE	FALSE

To help understand it, let's look at this example formula:

(n > 3) ⇒ (n > 1)

The first three rows of the truth table simply describe n>3, 1<n≤3 and n≤1.

The last row is when n>3 but not n>1 which is clearly wrong, hence FALSE.

Now, say you were asked to prove that:

(F⇒G) ≡ (¬F∨G)

is a tautology. You could write out all the truth tables which would be laborious. "However, computers are better at doing this sort of calculation."[1]

Very nice, but what's the application?

You can find a very nice model of some Kafka functionality here on GitHub. You can open this in the IDE called TLA+ Toolbox. It looks like this:

The language is pretty straight forward. For instance, pre-pending formulas with operators like /\ is syntactic sugar for removing wide parenthesis and EXTENDS seems to be like import in Java and Scala. You can find some blogs to learn more about it here [LearnTLA], here [Anton Sookocheff's blog] and here [Jack Van Lightly's blog].

The Toolbox can turn this into a more readable PDF (more readable for mathematicians anyway):

You can then "run" your model by giving it initial parameters. The Toolbox will then explore its state space. Interestingly, Hillel Wayne in this video descibes the path through the states as a description of the behavious of a system.

Another video can be found here and the transcript to Lamport's own presentation can be found here.

Conclusion

TLA+ looks like it could turn the IT architecture industry into something more than "it feels kinda right" that we see at the moment. One only hopes that its adoption in projects like Kafka will encourage other people to learn it. Me - I've only taken the first few steps.

[1] Specifying Systems, Dr Leslie Lamport

Agile Java Man