Wednesday, October 11, 2017

HBase, Cassandra and CAP


Which is best - Cassandra or HBase? Well, it depends what you're trying to do.

The oft-quoted comparisons generally involve Eric Brewer's CAP-theorem (the formal proof by Gilbert and Lynch can be found here and an updated review here) and you can see pictures like this dotted around the web:

From "Cassandra: the Definitive Guide" quoted here
(where the RDMBS offerings imply Two-Phased Commit).

Most developers have some awareness of CAP but there are a lot of misconceptions - typically that you can have at most two elements of Consistency, Availability or Partition tolerance - but not all three. "The 2 of 3 formulation was always misleading because it tended to oversimplify the tensions among properties" (from an article by Brewer himself found here).
Allowing at least one node to update state will cause the nodes to become inconsistent, thus forfeiting C. Likewise, if the choice is to preserve consistency, one side of the partition must act as if it is unavailable, thus forfeiting A. Only when nodes communicate is it possible to preserve both consistency and availability, thereby forfeiting P. The general belief is that for wide-area systems, designers cannot forfeit P and therefore have a difficult choice between C and A.
Here are some common misconceptions corrected.

  • The C in CAP "has got nothing to do with the C in ACID, even though that C also stands for 'consistency'" [1]. "Gilbert and Lynch use the word “atomic” instead of consistent in their proof, which makes more sense" [2]. The C in ACID has several similar but not mutually exclusive meanings which include not violating any database constraints, linearizability (see below), ensuring the data is consistent (eg, domain-specific rules such as taking X out of one bank account and into another means the total money overall does not change).
  • "The CAP model says nothing about transactions that touch multiple objects. They are simply out of scope for the theorem" [2].

In "Please stop calling databases CP or AP", Martin Kleppmann says:

"But if you’re using some other notion of consistency or availability, you can’t expect the CAP theorem to still apply. Of course, that doesn’t mean you can suddenly do impossible things, just by redefining some words! It just means that you can’t turn to the CAP theorem for guidance, and you cannot use the CAP theorem to justify your point of view.

"If the CAP theorem doesn’t apply, that means you have to think through the trade-offs yourself. You can reason about consistency and availability using your own definitions of those words, and you’re welcome to prove your own theorem. But please don’t call it CAP theorem, because that name is already taken."

In the CAP theorem:
  • Consistent means Linearizability,  think of visibility in terms of variables in the JVM. "If operation B started after operation A successfully completed, then operation B must see the the system in the same state as it was on completion of operation A, or a newer state." [1] A master that asynchronously replicates to slaves is an example of a non-linearizable system. Note: some systems (eg a DB that uses MVCC) are intentionally non-linearizable.

  • Availability means “every request received by a non-failing [database] node in the system must result in a [non-error] response” [from the proof]. Note, the response can take an arbitrary amount of time which might violate some people's notion of availability...

  • "Partition Tolerance (terribly mis-named) basically means that you’re communicating over an asynchronous network that may delay or drop messages." Which is basically the internet. "

    "In practical terms, a distributed system cannot be made immune to network failures... If the system chooses to favor Availability (such systems are designated AP systems), then it will continue to service requests, even though the data could be in an inconsistent state... If the system chooses to favor Consistency (known as a CP system), then it will choose to stop serving requests (become unavailable) if the consistency of data cannot be guaranteed. For a CP system, this is achieved by requiring a certain number of nodes to confirm that a data update has been made before acknowledging the update." [Hazelcast]
Using these definitions, Cassandra is not AP for quorum read/writes. If there is a partition split, the nodes on the wrong side of the split cannot reach a consensus and therefore are not available. (Interestingly, Brewer notes "as some researchers correctly point out, exactly what it means to forfeit P is unclear.")

So, although Cassandra and HBase both have their pros and cons, mentioning the CAP theory is orthogonal.

(Aside: although Elastic Search is a search engine not a database, there is an interesting chat about how CAP applies to it here.)

A solution?

There's a cheekily entitled blog post (it's titled "How to beat the CAP theorem" but in it he says "You can't avoid the CAP theorem, but you can isolate its complexity and prevent it from sabotaging your ability to reason about your systems"). This isolation is via immutability.

In this post, author Nathan Marz talks of DBs "choosing availability over consistency":
The best consistency guarantee these systems can provide is eventual consistency. If you use an eventually consistent database, then sometimes you'll read a different result than you just wrote. Sometimes multiple readers reading the same key at the same time will get different results... It is up to you to repair the value once you detect that the values have diverged. This requires tracing back the history using vector clocks and merging the updates together (called read repair)
Marz' proposal to use immutable data leads us to this conclusion:
If you choose consistency over availability, then not much changes from before. Sometimes you won't be able to read or write data because you traded off availability... Things get much more interesting when you choose availability over consistency. In this case, the system is eventually consistent without any of the complexities of eventual consistency. Since the system is highly available, you can always write new data and compute queries. In failure scenarios, queries will return results that don't incorporate previously written data. Eventually that data will be consistent and queries will incorporate that data into their computations.
This mitigates Brewer's proposal that revolves around reconciliation during a partition recovery phase. Marz' proposes no reconciliation, just a deferred sharing of the data.

[1] Martin Kleppmann's blog.
[2] Julian Browne's blog.

Sunday, October 8, 2017

Gradients Refresher


There is a lot of talk of gradients in Spark's machine learning library. Here are some notes to remind you what it's all about.

Approximation techniques

Newton method uses differentials. Secant methods use finite differences.

"For problems with more than a few variables, it also quickly became clear that the cost of evaluation Jacobians or Hessians [see below] at every step could be exorbitant. Faster methods were needed that might make use of inexact Jacobians or Hessians and/or inexact solutions of the associated linear equations, while still achieving superlinear convergence. An early breakthrough of this kind was the discovery of  Quasi-Newtonian methods in the 1960s by Broyden, Davidon, Fletcher and Powell, in which partial information is used to generate steadily improving estimates of the true Jacobian or Hessian or its matrix factors." [1]

Under certain circumstances "the convergence is quadratic, meaning that the error at each stage is roughly the square of the error at the previous stage" [1]

Taylor Expansion

A polynomial can be expressed as:

f(x) = f(a) + (x - a) f'(a) + (x - a)2 f''(a) / 2! + ... + (x - a)n fn(a) / n! + ...

where f' is the first differential, f'' the second and fn the n-th. (see Boas [2] p25 for a proof).

The multivariate version (that is, when x and a are vectors) looks like this:

f(x) = f(a) + (x - a) (∂f(a)/∂x+ ∂f(a)/∂x2) + (x - a)2 (∂2f(a)/∂x12 + 2 ∂f(a)/∂x1∂x2 + ∂2f(a)/∂x22) / 2! + ...

in the case of 2 dimensions (that is, when x = (x1, x2)) although it could be any number of dimensions.

For a polynomial of degree 2, the first three terms are all you need as any further differentiation will always yield zero. Alternatively, we might choose that the first three terms are "good enough". Either way, let's now drop all the other terms. We also note that the above equation, when we drop the extraneous terms, can be more succinctly expressed as matrices, thus:

f(x) = f(a) + pT ∇ f(a) + pT B p / 2

where p = x - a and B is the Hessian (see below). By putting some simple values into a square matrix, it's not hard to see that this last term in the matrix equation when expanded is the same as the more verbose equation above it.

Gradients

φ is a vector if φ is a field ("a collection of values with a plus operator and a times operator"[3]). In Cartesian 3 dimensions, it's:

φ = grad φ = i ∂φ /∂x + j ∂φ /∂y + k ∂φ /∂z

where φ is a scalar and i, j and k are unit vectors although there is no reason to restrict ourselves to 3 dimensions. It's just commonly used in the mathematical sciences.

"The vector ∇ φ is perpendicular to the surface φ = const" (see Boas [2] p292, 293, 296)

Divergence

If  acts on a vector, V:

V = i Vx(x, y, z) + j Vy(x, y, z) + k Vz(x, y, z)

then

 . V = div V = ∂Vx/∂x + ∂Vy/∂y + ∂Vz/∂z

Again, this is the special case in 3-D but we need not be restricted to just 3 dimensions.

Laplacian

The Laplacian occurs in a variety of problems in physics (wave equations, diffusion [2] etc). Again, this is the 3-D version but it can be generalized to any number of dimensions:

2 φ = div grad φ = ∂2φ /∂x2 + ∂2φ /∂y2 + ∂2φ /∂z2

"Now, that is just the mathematical definition, depending on the field where the Laplacian appears it may have different interpretations. The one I like is the relation with the mean curvature of a surface (how much bent the surface is). The mean curvature is the divergence of the unit normal of a surface, if the normal is given by the gradient of a function (that means that your surface is a level curve, an equipotential, an isopotential, etc...) then the curvature is proportional to the Laplacian." (from Quora)

Jacobians

"The Jacobian of x, y with respect to s, t is the determinant" [2]

J =
∂x/δs∂x/∂t
∂y/∂s∂y/∂t

The Jacobian matrix is the matrix for which this is the determinant, namely:

Ji,j = ∂fi / ∂xj

From what I can tell, 'Jacobian' can refer to either the matrix or its determinant depending on the context.

"The Jacobian can be interpreted as the density of the space. In other words, let’s say that we have J = 5 for a given system xi. This means that we have packed 5 units of Cartesian space into a volume in xi space through transformation from Cartesian to the given system." [4]

Hessians

A Hessian is a matrix that looks like:

Hi,j = ∂2f / ∂xi ∂xj

Confusingly, 2 the operator is used for both the Laplacian and the Hessian. However, if  is a column matrix, the Hessian is T and the Laplacian is ∇ T.

Note that the "trace of the Hessian (the sum of the diagonal elements) is the Laplace operator." (from Quora.com)

Note that the "Hessian is always symmetric" [1].

[1] The Princeton Companion to Mathematics
[2] Mathematical Methods in the Physical Sciences, Boas
[3] Coding the Matrix, Klein
[4] Tensor Analysis for Engineers, Tabatabaian