Monday, June 30, 2014

Quick Wins with the Garbage Collector


The problem

Recently, we upgraded our boxes from Linux 2.6.16-0.87.1-smp running on 16-processor Xeon E5620s to Linux 2.6.32-358.32.3.e16x86_64 running on 24-processor Xeon E5-2620s (all processor counts include hyper-threading).

Unfortunately, our GC times because horrendous. What exactly this was due to remains a mystery, but we were seeing events like this in the GC logs:

.
.
19948.548 [GC [PSYoungGen: 1386848K -> 5888K (1390080K)] 3481117K->2105461K(4186624K), 0.0306870 secs] [Times: user=0.18, sys=0.00, real =0.03 secs]
20077.869 [GC [PSYoungGen: 1387264K -> 5984K (1389568K)] 3486837K->2210813K(4186112K), 9.4264420 secs] [Times: user=4.05, sys=52.99, real =9.43 secs]
20131.745 [GC [PSYoungGen: 1387360K -> 5792K (1390080K)] 3492189K->2115622K(4186624K), 0.0271300 secs] [Times: user=0.16, sys=0.01, real =0.03 secs]
.
.

Whoa! What happened there? Nearly 10 seconds of GC pause times when all the others are fractions of a second?

Also, look where the time was spent - in kernel space (sys=52.99).

AlwaysPreTouch

Given that it's spending a lot of time in the kernel, it might be best to look at flags that interact with that. One such flag is -XX:+AlwaysPreTouch. This loads all the pages into memory at start-up. That is, there is an upfront cost.

If you run your JVM with this flag, you'll see the process taking its full amount of memory when you run the top command. That is, without it, you'll see:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                          
 6960 henryp    20   0 3312m  19m  10m S   16  0.3   0:00.97 java      

but with it:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                          
 6923 henryp    20   0 3312m 1.0g  10m S   16 14.0   0:02.40 java                                                                            

That is, the RESident space has filled to its assigned 1024m immediately (VIRTual space does not correspond to physical memory and can be ignored).            

Pre-touching pages

The JVM arg -XX:+AlwaysPreTouch causes this code in mutableSpace.cpp to execute when initializing MutableSpace:

void MutableSpace::pretouch_pages(MemRegion mr) {
  for (volatile char *p = (char*)mr.start(); p < (char*)mr.end(); p += os::vm_page_size()) {
    char t = *p; *p = t;
  }
}

(You can also find something similar in VirtualSpace::expand_by)

I'm not a C++ guru, but this appears to do nothing but make a (method-scoped) reference to an area of memory. Why does it do this? Well, by making a reference to that memory page, the kernel is obliged to load it. Presumably, the reference is volatile just in case the compiler thinks that this is pointless and optimizes it away.

This link talks about how lazily loading the pages can be more expensive in the case of writing to a lot of the pages. With AlwaysPreTouch, this hit was taken when the JVM started up.

The Garbage Collector and Network Performance


Our network woes continue.

Coherence has an MBean (Coherence:type=Node,nodeId=XXX) that measures intra-cluster packet loss using its own proprietary protocol, TCMP. In our old environment, this showed a 99.9+ per cent success rate. In our new, it was touching about 50%

Coherence has a tool that measures packet loss within a network. Employing this tool in our new environment showed no problems. So, it seemed the problem wasn't the network.

The only way I could see the problem using this Datagram Tool was to set the buffers at unrealistically low buffer sizes. But we were using Coherence's defaults. What was the problem?

Well, thinking about it, what could cause the buffers to overflow? One answer is that nobody was emptying them. This could be because of the garbage collector.

"Large inbound buffers can help insulate the Coherence network layer from JVM pauses that are caused by the Java Garbage Collector. While the JVM is paused, Coherence cannot dequeue packets from any inbound socket. If the pause is long enough to cause the packet buffer to overflow, the packet reception is delayed as the originating node must detect the packet loss and retransmit the packet(s)." [1]

Indeed, our garbage collection times are much higher on the new, improved hardware. How odd. (This is the subject of another post).

Rather than measure these packet loss statistics dynamically through JConsole, you can also have Coherence print a report every so often. You can do this via the MBean with the object name Coherence:type=Reporter,nodeId=YYY where YYY is your node of choice. Configuration is fairly self-explanatory and it gives you the chance to plot performance over time.

[1] Coherence Documentation.

Sunday, June 15, 2014

Lies, Damned Lies and Performance Statistics


A new way of measuring performance

I've introduced HdrHistogram to my pet project, JStringServer. The advantages of HdrHistogram is that it keeps the results in constant memory with constant access time. The only cost is some approximation of results and since I'm only interested in means, standard deviations etc I can live with that.

HdrHistogram works by keeping an array of "buckets" which represent a range of values. Any given value that falls into a particular bucket merely increases that bucket's counter. This is much more memory efficient than, say Apache's JMeter that keeps every reading in memory.

The range of these buckets grow exponentially (well, there are sub-buckets but let's keep this simple). So, all results that are outliers are collected in just one or two buckets. This is very memory efficient as each bucket is a Java long each.

Means, standard deviations and all that

Most performance tools like Apache's JMeter will give you a mean and a standard deviation. You might have even heard that 68% of results are within a standard deviation of the mean, 95% within 2 standard deviations, 99.7% within 3 etc (see here for the rule). But this is only true for a normal distribution (the typical, bell-shaped curve).

The time it takes for a given measurement independent of all others will typically conform to a Poisson distribution. (See Mathematical Methods in the Physical Sciences by Mary Boas for a clear derivation of this formula from first principles.)

Results

Typically, I see:

Initiated 412759 calls. Calls per second = 10275. number of errors at client side = 0. Average call time = 10ms
ThreadLocalStopWatch [name=read, totalCallsServiced=412659, max time = 630ms, Mean = 6.125236575477573, Min = 0, standard deviation 8.263969133274433, Histogram with stepsize = 1. Approximate values: 791, 1822, 3524, 15650, 143582, 307867, 243841, 81151, 18447, 4716, 1395, 493, 314, 195, 106, 97, 82, 66, 35, 10, 6, 7, 10, 7, 12, 14, 8, 7, 4, 2]
ThreadLocalStopWatch [name=write, totalCallsServiced=412719, max time = 26ms, Mean = 0.08618696982692825, Min = 0, standard deviation 0.327317028847167, Histogram with stepsize = 1. Approximate values: 411831, 33604, 771, 238, 80, 32, 17, 8, 1, 1, 2, 4, 5, 3, 0, 0, 1, 1, 1, 1, 0, 0, 1, 2, 5, 8, 4, 0, 0, 0]
ThreadLocalStopWatch [name=connect, totalCallsServiced=412720, max time = 3005ms, Mean = 4.17902209730568, Min = 0, standard deviation 65.54450133531354, Histogram with stepsize = 1. Approximate values: 408721, 117052, 2056, 852, 282, 114, 57, 30, 15, 12, 10, 10, 15, 10, 3, 5, 4, 3, 2, 1, 3, 4, 2, 0, 1, 1, 0, 1, 1, 0]
ThreadLocalStopWatch [name=total, totalCallsServiced=410682, max time = 66808064ns, Mean = 6295825.973268021, Min = 854104, standard deviation 1089344.7561186429, Histogram with stepsize = 1333333. Approximate values: 145, 837, 1596, 25512, 273990, 93021, 12099, 2283, 520, 273, 126, 77, 80, 55, 11, 9, 5, 9, 10, 11, 15, 5, 3, 2, 3, 3, 0, 0, 4, 1, 2]

For something that approximates to a Poisson distribution, we'd expect the mean and standard deviation to be about the same. Since this is not true for the total time, perhaps this is not a Poisson distribution. The results for reading data does have these two values roughly the same so let's look at them.

Let's see if the distribution of the values conform to the Poisson distribution or even the Normal (aka Gaussian). In the R language:

require(graphics)

# Histogram of times. Each interval is 1ms starting from 1ms
xread <- c(791, 1822, 3524, 15650, 143582, 307867, 243841, 81151, 18447, 4716, 1395, 493, 314, 195, 106, 97, 82, 66, 35, 10, 6, 7, 10, 7, 12, 14, 8, 7, 4, 2)
readMean = 6.125236575477573
readSd   = 8.263969133274433
readNum  = 412659

par(mfrow = c(3, 1)) # we're looking at 3 graphs

n      = readNum
x      = xread
mean   = readMean
sd     = readSd

ac <- barplot(x, main="Actual Distribution", axes=TRUE) 
axis(1, at=ac, labels=TRUE)

<- 1:length(x)

expectedPoisson = n * dpois(i, lambda=mean)
expectedNorm    = n * dnorm(i, mean=mean, sd=sd)

nm <- barplot(expectedNorm, main="Expected Normal Distribution")
axis(1, at=nm, labels=TRUE)

pn <- barplot(expectedPoisson, main="Expected Poisson Distribution")
axis(1, at=pn, labels=TRUE)

chisq.test(x, p=expectedPoisson, rescale.p=TRUE)
chisq.test(x, p=expectedNorm,    rescale.p=TRUE)

The last two lines are testing whether the actual distribution conforms to the Poisson (give a lambda value) or a Normal (given the mean and the standard deviation). It does this using the chi-squared test (basically, a comparison of the squared differences between what we expect and what we get).

For these particular results, the p-value from the chi-squared tests tell me they neither conform to a Poisson not a Normal distribution (in fact, they could barely be further from them). Looking at the graphs, you can see this by-eye:



So, what went wrong? Well, the Poisson distribution relies on events being independent of each other which is clearly not the case when there is contention for resources.

The Central Limit Theorem

(Aside):

The means of a distribution form a normal distribution given enough runs even if that original distribution is not itself a normal distribution.

Let's look at plotting the Poisson distribution in R:

mu=1; sigma=10   # mean, standard deviation
n=5000           # Number of iterations
xbar=rep(0,n)    # Holds the results of the iterations

for (i in 1:n) { 

  xbar[i]=mean(
    rpois(5000, mu) # num. random variables = 5000, mu = means
    ) 
}

par(mfrow = c(2, 1))


# Plot a typical Poisson distribution

hist(rpois(n, mu),prob=TRUE,breaks=15)

# Plot the results of n Poisson distributions

hist(xbar,prob=TRUE) #,xlim=c(70,130),ylim=c(0,0.1))

Gives this result:

A typical Poisson distribution and the mean of a Poisson over many iterations
This is quite nice as it means that if you run performance tests repeatedly, your means will approach a Normal Distribution giving some predictive power over them.