Thursday, December 22, 2016

Spark and the GC


Spark is memory hungry and that means you need to consider the garbage collector when tuning.

There are a number of GCs to chose from that may or may not be appropriate for your job. Some say that G1 may give you (soft) guarantees for maximum latency time but are not good for batch jobs. Others point out that CMS may give lower pauses than ParallelGC (the default on 64-bit machines) but that ParallelGC has better overall throughput and since we're running a batch job not a web server, ParallelGC may suit Spark.

More logs than a lumberjack

Whichever GC you use, you'll need to turn logging on and analyse the logs.

You can see objects being promoted from young to old generation (see here and here) but it's inferred rather than stated explicitly. At a young GC, you'll see something like:

[PSYoungGen: BEFORE_YOUNG->AFTER_YOUNG(CAPACITY_YOUNG)] BEFORE_HEAP->AFTER_HEAP(CAPACITY_HEAP)

Which describes the memory usage in the young generation before GC (BEFORE_YOUNG), after (AFTER_YOUNG) and the capacity of the young generation.

It also shows the usage of the heap as a whole before (BEFORE_HEAP), after (AFTER_HEAP) and the heap's total capacity.

Note that it is not necessarily the case that:

Δ = (BEFORE_YOUNG-AFTER_YOUNG) - (BEFORE_HEAP-AFTER_HEAP) != 0

This is because not all of the young generation was GCed. Some was promoted to the old generation.

In an attempt to avoid premature promotion, and with JVMs that had 8gb of memory, I set -XX:MaxNewSize=6G -XX:SurvivorRatio=6 but instead of taking 5 hours, it took 6. I tried this because there was a lot of new generation GC and not a lot of old.

Trying G1GC took just over 6 hours. And setting -XX:MaxNewSize=1G -XX:SurvivorRatio=3 just had to be killed as it looked like it was going to take days. So, I had to look elsewhere.

Disks

You can also check your disk access times by using a script taken from here, namely:

dd if=/dev/zero of=/tmp/output.img bs=8k count=256k
rm /tmp/output.img

A decent HDD will typically give you about 300MB/s, a bog-standard SSD (my NUC at home) about 500MB/s.

The best way to avoid Garbage Collection...

... is not to create any garbage. This might not always be possible but you can minimize GC. Kryo will compress objects amazingly well by replacing the verbose String that represents their FQNs with just a byte or two. However, it will only compress components of the class you register with it, not a deep tree of components. For instance, if you register org.apache.spark.mllib.linalg.SparseVector, you'd do well to also register Array[Int] and Array[Double] as these are what hog memory.

It is not sufficient to just register Array[_].

You can put something like sparConf.set("spark.kryo.registrationRequired", "true")in your code to see what needs to be explicitly added. This will cause runtime exceptions to be thrown until everything to be (un)compressed has been explicitly registered.

After these changes, the shuffle went from some 2.3TB to 300GB and consequently the time for the job dropped from about 5 hours to less than 45 minutes. With less memory to churn there was less GC and this makes a huge difference to how long your Spark job takes.

Friday, December 16, 2016

More Spark tuning


I've spent some time trying to tune matrix multiplication in Spark. I had 32 executors each with 4 threads and sampled the stacks with jstack over a few minutes with a handy little script:

> cat ~/whatsEveryoneDoing.sh
for BOX in `cat ~/boxes.txt` ; do { echo $BOX ; ssh $BOX "for PID in \`ps -ef | grep $1 | grep jdk8 | grep -v bash | awk '{print \$2}'\` ; do { sudo -u yarn jstack \$PID | grep -A45 ^\\\"Exec ; } done ; echo " 2>/dev/null ; } done

where ~/boxes.txt is just a file with the addresses of all the boxes in the cluster and the argument with which you invoke it is the application ID of the job.

I noticed that of my 128 executor threads, in each sample 50 or so were contending for the lock at:

org.apache.spark.storage.memory.MemoryStore.reserveUnrollMemoryForThisTask(Memstore.scala:570)

(This is Spark 2.0.2).

So, reducing --exector-cores from 4 to 2 while only increasing the number of executors from 32 to 48 had a big impact. The runtime was reduced from about 7.5 hours to 5 hours.

Wednesday, December 14, 2016

Bayesian Fun


The Model

Dr Bristol has tea with statistician Sir Ronald Fisher. She claims she can tell the difference between when the tea is poured first then the milk or milk first then the tea. Sir Ronald doesn't quite believe this and tests her. Sure enough, she gets it right five out of six times. Is this statistically significant?

Sir Ronald "arrived at a method that consists of calculating the total probability of the observed result plus the probability of any more extreme results possible under the null hypothesis (i.e., the probability that she would correctly identify 5 or 6 cups by sheer guessing). This probability is the p-value. If it is less than .05, then Fisher would declare the result significant and reject the null hypothesis of guessing." (from here).

The p-value debate has been rumbling for a while ("for example, a 0.05 p-value does not imply the false positive rate is 5.0%, it can be much higher."). Briefly, if there are 100 000 hypothesis and only 10% are true, of the 90 000 that are not true, 4500 will randomly be within the 5% (p-value = 0.05) range. So, 4500 of the 14 500 hypothesis that we think are true are false - and this is if we correctly identify all of the 10%! That means 31% of our knowledge is rubbish and that's the best case scenario!

Devout Bayesian, Prof Dennis Lindley, in this article discusses why they are inadequate and proposes a Bayesian solution.

The Problem

Lindley notes that when using p-values, the probabilities change depending on your experiment and not the results. To illustrate:

"If the experiment consisted of 6 pairs of cups being tested and the result was RRRRRW, the relevant probability is .109", that is 6C5 ½6C6 ½6 and here, R means 'right' and W means 'wrong'.

"If the experiment consisted of pairs being tested until the first error, with the same result, the relevant probability is .031", that is ½5.

This "leads to absurdities because it hinges upon the nonexistent ability to define what other unobserved results would count as “more extreme” than the actual observations. That is, if Fisher had set out to serve Dr. Bristol 6 cups (and only 6 cups) and she is correct 5 times, then we get a p-value of .1, which is not statistically significant. According to Fisher, in this case we should not reject the null hypothesis that Dr. Bristol is guessing. But had he set out to keep giving her additional cups until she was correct 5 times, which incidentally required 6 cups, we get a p-value of .03, which is statistically significant. According to Fisher, we should now reject the null hypothesis. Even though the data observed in both cases are exactly the same, we reach different conclusions because our definition of “more extreme” results (that did not occur) changes depending on which sampling plan we use." (Etz et al)

Bayesian Analysis

Bayes defined the posterior probability as:

posterior = K x prior x likelihood

"where K is a number chosen to make the integral of the right-hand side 1." (Lindley)

"Likelihood is a funny concept. It’s not a probability, but it is proportional to a probability ... Since a likelihood isn’t actually a probability it doesn’t obey various rules of probability. For example, likelihood need not sum to 1.

"A critical difference between probability and likelihood is in the interpretation of what is fixed and what can vary. In the case of a conditional probability, P(D|H), the hypothesis is fixed and the data are free to vary. Likelihood, however, is the opposite. The likelihood of a hypothesis, L(H|D), conditions on the data as if they are fixed while allowing the hypotheses to vary.

"Bayes factors are simple extensions of likelihood ratios. A Bayes factor is a weighted average likelihood ratio based on the prior distribution specified for the hypotheses." [from Etz's blog]

The Priors

Lindley considers the case of wine and tea tasting and people who claim they can tell you about the process with which they were made. The author believes that wine experts really know what they're tasting while tea experts do not.

Lindley encodes these beliefs in the following equations:

Belief that the wine taster can correctly identify wines with probability P:

48(1-P)(P-½) for ½ < P < 1

"This expresses the fact that I think that she can discriminate but can make mistakes. The value 48 makes the total probability 1," he says.

Belief that the tea taster can correctly identify the preparation process with probability P:

l.6(l-P) for P ≥ ½

Note the l.6 is derived from integrating this function between  P ≥ ½ and equating it to 1 (ie, the area under the curve cannot exceed 1.0 as that's the highest a probability can go).

The Data

Given 6 trials, the probability that the taster correctly identifies all 6 is P6. The posterior in this case for tea is:

normalizer . prior . likelihood = K (1 - P) P6 for ½ < P < 1

The probability that the taster identifies the first 5 correctly but the last incorrectly is P5(1-P).

"The prior value of this probability was .8, which drops to .59 when 1 error is made in 6 pairs", that is, P at the maximum posterior at RRRRRW is about 0.63.

"...and to .23 with no errors", that is, P at the maximum posterior for RRRRRR is about 0.86. So, the prior for this is 1.6 x (1 - 0.86) = 0.23

[Incidentally, we know this because the posterior ~ P6(1-P). So, if we differentiate to find the maximum:

(P6(1-P)) = 0
dP

or

6P5-7P6=0

which means P=6/7. This is the probability of the taster having special powers. Plug this in to our equation representing out belief and this becomes l.6(l-6/7) which is the 0.23 Professor Lindley is talking about.]

Naturally, the graph for wine tasting is different as our priors are a quadratic, not a linear equation in P. Their peaks are somewhat closer to 1.0 than our tea tasters (as we'd expect given the prior expresses greater confidence). But it's also worth noting the shape of the graphs. There is greater certainty when we see 6 Rs than with RRRRRW as the curve is sharper.

Conclusions

"It is typically true that the posterior probability of the null hypothesis exceeds the significance level, though there is no logical connection between the two values." (Lindley)

"Lindley’s Bayesian approach depends only on the observed data, so the results are interpretable regardless of whether the sampling plan was rigid or flexible or even known at all. Another key point is that the Bayesian approach is inherently comparative: Hypotheses are tested against one another and never in isolation" (Etz et al).