Sunday, August 15, 2010

Emergent (Mis)behavior

I've been puzzling for a while over why our bond allocation program would suddenly hang when under stress.

Fortunately for us, we have a good test harness and I could attach a profiler and watch and wait for it to happen. Eventually, it did:


Emergent behaviour as all threads spontaneously start contending for a resource

All the threads found themselves in the same logjam - but for no discernible reason. It was like a spontaneous "phantom jam" on a motorway (there is an interesting computer simulation of exactly this in a video here).

It was clear where the threads were choking (mentioned here).

But why did an innocent - if not particularly efficient - piece of code cause the system to grind to a halt?

Well, it's all down to something called emergent behaviour - or Emergent (Mis)behavior as Jeffrey C. Mogul calls it in his excellent paper here. One (of many) definitions is:

Emergent behavior is that which cannot be predicted through analysis at any level simpler than that of the system as a whole.

Mogul's paper does not address the question of whether emergent behaviour is a property of the system or a function of our ignorance. But he does give many examples of where it has occurred in software systems and gives a warning to the architects of SOA systems. Service Oriented Architecture emphasises a separation of concerns:

The fundamental benefit to solving problems this way is that a number of the solution logic units can be designed to solve immediate concerns while still remaining agnostic to the greater problem.
(SOA, Principles of Service Design, p70 - Thomas Erl)

Although it's a good design principle that helps minimize maintenance costs, it can lead the unwary into a false sense of security. To avoid this, one can only watch a system's behaviour in a test environment for a long period of time - time that most projects don't have. Most projects fix these problems once they have gone live.

There is very little literature in this area but Mogul attempts to build a taxonomy of emergent misbehaviour with the hope that one day we'll be better equipped to say what a particular system may be prone to it.

Emergent behaviour also occurs in non-computer systems. As examples: it was not obvious from the constitution of the Weimar Republic that it could allow the rise of a tyrant like Hitler; the book Freakonomics describes many examples where unexpected behaviour arises in otherwise well-understood systems; the designers of the Millennium Bridge in London understood footbridges and understood how crowds behave on them but did not predict how the crowds response to the bridge would feed back into how the bridge behaved under load.

Regarding this last example, Dr Mogul points out that although this phenomena was know, no quantitive analysis existed. The same problem still afflicts IT community.

Monday, August 9, 2010

Hash brown-outs

How do you tell if a production server has a memory leak? Simply run:

$JAVA_HOME/jmap -dump:file=/tmp/dump_file PID

and then load the file into your profiler of choice and see where all the memory is being used.

I had to do this last week. As it happened, there wasn't a memory leak but a ConcurrentHashMap that was being populated faster than its elements could be consumed.

The interesting thing was that the memory profile looked all wrong. To see what I mean, we have to look closer at ConcurrentHashMap.

What is a ConcurrentHashMap?

ConcurrentHashMap is a map that more than one thread can read and update without any fear of ConcurrentModificationExceptions being thrown.

How does it do this?

ConcurrentHashMap is basically a hash table of hash tables. These inner hash tables are called "segments". Threads that insert elements only lock one of these segments. This is one way that ConcurrentHashMap improves throughput.

Threads that wish to read do not use locking. Instead, you see in the code, snippets like this:

transient volatile HashEntry[] table;
.
.
.
HashEntry getFirst(int hash) {
HashEntry[] tab = table;
return (HashEntry) tab[hash & (tab.length - 1)];
}


Without the method-scoped pointer (tab) pointing to the object-scoped pointer (table) on the first line of the method, the second line would have a race condition. That is: the table may change size as the array index was being calculated. This would return the wrong element.

Of course, this means that between the first and the second line, tab may have become stale. The danger is that this method could return an element that another thread had already removed from the "real" table.

[Aside: pointing a reference (tab) to another reference (table) is, of course, atomic ("Writes to and reads of references are always atomic, regardless of whether they are implemented as 32 or 64 bit values." - Java Language Spec 17.7, 3rd Edition).]

The JavaDocs do warn you about this potential "staleness" when it says:

"Retrievals reflect the results of the most recently completed update operations holding upon their onset."

A bug in ConcurrentHashMap

Back to my memory dump. The problem seen in the ourput from jmap was that all of the objects we wanted to store in the ConcurrentHashMap were in just one segment rather than being evenly distributed. The code of ConcurrentHashMap that determines which segment into which an object is to be inserted looks like this:

final Segment segmentFor(int hash) {
return segments[(hash >>> segmentShift) & segmentMask];
}


Where hash is the hashCode of our object and segmentShift is 28 for the default constructor of ConcurrentHashMap. Reducing this significantly means gobbling huge amounts of memory.

Shifting an Integer to the right 28 times means only a very small range of numbers that can be stored in a 32-bit Integer are useful - those greated than 2 e 29 and less than -2 e 29.

This bug is mentioned here but Sun/Oracle appear to say it is fixed in my version of Java when it definitely isn’t (as the source code shows).

This leaves us with the choice of hacking our hashCode() method to achieve these extreme numbers or grin and bear it until we upgrade our Java to a version that has fixed this problem.

It's not a major issue so it looks like the latter.

Sunday, August 1, 2010

Go test, young man

Since I've had a week of testing, I thought I'd share some thoughts on an area much neglected by most Agile methodologies: non-functional testing. Everybody knows it's important but most books only make a passing reference to it.

Stress Tests

Since I have my copy of Pressman sitting next to me, I'll use his definition:

Stress tests are designed to confront programs with abnormal situations. In essence, the tester who performs stress testing asks: "How high can we crank this up before it fails?"
(Software Engineering: A Practitioners Approach - Roger S Pressman)

The trouble is: what defines "fails"? For a web application, it might be when the time it takes to service a request exceeds a predefined limit. But which metric and at which point in time? The average time? And which average - mean or median? Or maybe the 90% line? And what if it just spikes momentarily then settles down? Has the test failed at this point? I don't have the answers.

Sufficient for us was when our system started throwing (ie, the first one) non-deterministic ConcurrentModificationExceptions after the code naively tries 5 times to change a data structure other threads are changing (this poor design is fixed in the next release).

Look out for deadlocks in your code during these tests. The annoying thing about deadlocks is that they tend to happen when your system is most under stress so now is a chance to catch them rather than when a production release has gone live and is getting hammered. The JConsole that comes with JDK6 can do this detection automatically for you.

Soak Tests

For want of a better reference, Wikipedia defines Soak Tests as "testing a system with a significant load extended over a significant period of time".

For most practical purposes, this boils down to memory leaks. My advice is to attach JConsole and look and look at the memory of the JVM. Then start the stress tests and go home. The following morning, look at the pretty graph JConsole has produced. It should be a nice saw-tooth shape with the average remaining pretty much constant.

Test Harnesses

Apache's JMeter should be good enough for most of your needs. But the question of "who watches the watchmen?" [quis custodiet ipsos custodes?] has never been more important. Two gotchas that got me were:
  1. too much monitoring in JMeter causes it to use a lot of memory and start performing poorly thus giving false data.
  2. my test code had a memory leak. I had to connect to the system I was testing using in-house code. This code tries to connect to a messaging system to send audits for each request to the system. If it cannot, it quietly stores the messages in memory for sending later. And so, it is the test harness that slowly runs out of memory not the system under test. This is generally not your first thought. I fixed this problem by getting the source code of the client JAR and commenting out this call before redeploying the test harness.
In each case, I spotted the problem by attaching JConsole to the test harness.

Performance Testing

Put together a test with typical usage metrics. I once worked for a company that had outsourced a lot of it's coding. The legal draft contract stated that access to page X would take place in no more than 2 seconds. The vendor must have been rubbing its hands with glee as that should have been easy to achieve with just one person using the system. The draft never specified the number of concurrent users. If the system didn't scale, it was legally no longer the vendor's problem. They had met their contractual obligation.

"Performance tests ... often require both hardware and software instrumentation" says Pressman. On Linux, it should be sufficient to run the top command to get a good idea of CPU usage (vmstat is also useful if you want to monitor IO). The stat that is the most interesting is the load average. If this number is greater than your number of CPUs, work is starting to backlog. Or, to put it another way:

“If there are four CPUs on a machine and the reported one-minute load average is 4.00, the machine has been utilizing its processors perfectly for the last 60 seconds”
Linux Journal Dec 1 2006

Profiling

You can use your performance tests to drive your system while you profile the JVM. There are (as ever) some caveats:
  1. There's a testing version of the Heisenberg Uncertainty Principle: the act of observation affects the system. Profiling adds an overhead so do this as well as your non-profiling performance tests.
  2. Just because your profiler shows that a particular method takes a long time it doesn't mean it did a lot of work. Message driven systems will typically spend a lot of time waiting for something to happen. This shows up in the profiler as a method that is "hogging" CPU time. This is a false interpretation. Another common time "hog" is waiting IO code - typically this is an order of magnitude greater than your code in a server. So don't be surprised if you see a lot of time spent here:
java.net.SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int)
java.net.SocketInputStream.read(byte[], int, int)
java.io.BufferedInputStream.fill()
java.io.BufferedInputStream.read()
java.io.DataInputStream.readInt()

Tomcat Tuning

The continuing hunt for performance improvements in our app continues - this time in the Web Tier.

I tried adding the following JVM args when I ran the JMeter performance tests:

-XX:+AggressiveOpts
Uses performance improvements in the JVM that are still experimental in a given release [1].

–XX:+
UseFastAccessorMethods
Inlines getXXX() methods for primitives [2].

–XX:+
CompileThreshold
Controls when HotSpot compiles code into native bytecode. High numbers mean faster start up time, low numbers mean the code is optimized sooner. For a server, a low number may be desirable [3].

-XX:+UseBiasedLocking

Allows threads to have an affinity with a lock they frequently use. Makes attaining that lock faster [4].

There are many others but from reading the literature, these looked likely candidates.

Unfortunately I saw no improvements on my machine. This last point is important. Currently, we have a shortage of QA machines so I had to try the tests on my Windows machine. I protested that this was not an accurate way of doing things. For instance, adding the –server JVM arg massively improved performance. I was quite excited until I realized that for multi-core Linux machines such as our production boxes this is set as default. On 32-bit Windows machines, it is not [5].

So, I gave up on JVM tuning and looked at Tomcat config. The obvious first choice was to use NIO (non blocking IO). This is easy. You just make your connector in server.xml look like this:

<connector port="8080"
protocol="org.apache.coyote.http11.Http11NioProtocol"

socket.rxBufSize="100000"
socket.txBufSize="100000"
socket.directBuffer="true"
connectionTimeout="20000"
redirectPort="8443" />

But although this massively reduced the number of threads from roughly one thread per connection to about half a dozen, it didn't seem to speed things up.

Time to fire up the profiler of choice, YourKit. Under normal load, the threads look like this:
Fair enough. A spot of contention (red) but mostly waiting (yellow) for something to happen (eg, an incoming request). But some waiting was for a message from an internal library that was developed in-house by another team (profilers can't distinguish between legitimate waiting and unnecessary waiting).

So, I looked at the thread the third party library gave us and it looks like this:
Green indicates the thread is doing work - light green when it executes in native code, dark green for all other code. The light green in this picture is just listening on a socket in native code, the dark green is just one thread that happens to be doing a lot of serialization (see the bottom half of the screen).

So, we have a bottle neck that's nothing to do with Tomcat. Tomorrow, I am going to rip out this in-house library as it's caused us nothing but pain and go back to Java RMI. Then, let's see how much we can tune Tomcat.

[1] http://www.oracle.com/technetwork/java/tuning-139912.html#section4.2.7
[2] http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
[3] http://java.sun.com/docs/books/performance/1st_edition/html/JPAppHotspot.fm.html
[4] http://www.oracle.com/technetwork/java/tuning-139912.html#section4.2.5
[5] http://download-llnw.oracle.com/javase/6/docs/technotes/guides/vm/server-class.html