Monday, August 9, 2010

Hash brown-outs

How do you tell if a production server has a memory leak? Simply run:

$JAVA_HOME/jmap -dump:file=/tmp/dump_file PID

and then load the file into your profiler of choice and see where all the memory is being used.

I had to do this last week. As it happened, there wasn't a memory leak but a ConcurrentHashMap that was being populated faster than its elements could be consumed.

The interesting thing was that the memory profile looked all wrong. To see what I mean, we have to look closer at ConcurrentHashMap.

What is a ConcurrentHashMap?

ConcurrentHashMap is a map that more than one thread can read and update without any fear of ConcurrentModificationExceptions being thrown.

How does it do this?

ConcurrentHashMap is basically a hash table of hash tables. These inner hash tables are called "segments". Threads that insert elements only lock one of these segments. This is one way that ConcurrentHashMap improves throughput.

Threads that wish to read do not use locking. Instead, you see in the code, snippets like this:

transient volatile HashEntry[] table;
.
.
.
HashEntry getFirst(int hash) {
HashEntry[] tab = table;
return (HashEntry) tab[hash & (tab.length - 1)];
}


Without the method-scoped pointer (tab) pointing to the object-scoped pointer (table) on the first line of the method, the second line would have a race condition. That is: the table may change size as the array index was being calculated. This would return the wrong element.

Of course, this means that between the first and the second line, tab may have become stale. The danger is that this method could return an element that another thread had already removed from the "real" table.

[Aside: pointing a reference (tab) to another reference (table) is, of course, atomic ("Writes to and reads of references are always atomic, regardless of whether they are implemented as 32 or 64 bit values." - Java Language Spec 17.7, 3rd Edition).]

The JavaDocs do warn you about this potential "staleness" when it says:

"Retrievals reflect the results of the most recently completed update operations holding upon their onset."

A bug in ConcurrentHashMap

Back to my memory dump. The problem seen in the ourput from jmap was that all of the objects we wanted to store in the ConcurrentHashMap were in just one segment rather than being evenly distributed. The code of ConcurrentHashMap that determines which segment into which an object is to be inserted looks like this:

final Segment segmentFor(int hash) {
return segments[(hash >>> segmentShift) & segmentMask];
}


Where hash is the hashCode of our object and segmentShift is 28 for the default constructor of ConcurrentHashMap. Reducing this significantly means gobbling huge amounts of memory.

Shifting an Integer to the right 28 times means only a very small range of numbers that can be stored in a 32-bit Integer are useful - those greated than 2 e 29 and less than -2 e 29.

This bug is mentioned here but Sun/Oracle appear to say it is fixed in my version of Java when it definitely isn’t (as the source code shows).

This leaves us with the choice of hacking our hashCode() method to achieve these extreme numbers or grin and bear it until we upgrade our Java to a version that has fixed this problem.

It's not a major issue so it looks like the latter.

Sunday, August 1, 2010

Go test, young man

Since I've had a week of testing, I thought I'd share some thoughts on an area much neglected by most Agile methodologies: non-functional testing. Everybody knows it's important but most books only make a passing reference to it.

Stress Tests

Since I have my copy of Pressman sitting next to me, I'll use his definition:

Stress tests are designed to confront programs with abnormal situations. In essence, the tester who performs stress testing asks: "How high can we crank this up before it fails?"
(Software Engineering: A Practitioners Approach - Roger S Pressman)

The trouble is: what defines "fails"? For a web application, it might be when the time it takes to service a request exceeds a predefined limit. But which metric and at which point in time? The average time? And which average - mean or median? Or maybe the 90% line? And what if it just spikes momentarily then settles down? Has the test failed at this point? I don't have the answers.

Sufficient for us was when our system started throwing (ie, the first one) non-deterministic ConcurrentModificationExceptions after the code naively tries 5 times to change a data structure other threads are changing (this poor design is fixed in the next release).

Look out for deadlocks in your code during these tests. The annoying thing about deadlocks is that they tend to happen when your system is most under stress so now is a chance to catch them rather than when a production release has gone live and is getting hammered. The JConsole that comes with JDK6 can do this detection automatically for you.

Soak Tests

For want of a better reference, Wikipedia defines Soak Tests as "testing a system with a significant load extended over a significant period of time".

For most practical purposes, this boils down to memory leaks. My advice is to attach JConsole and look and look at the memory of the JVM. Then start the stress tests and go home. The following morning, look at the pretty graph JConsole has produced. It should be a nice saw-tooth shape with the average remaining pretty much constant.

Test Harnesses

Apache's JMeter should be good enough for most of your needs. But the question of "who watches the watchmen?" [quis custodiet ipsos custodes?] has never been more important. Two gotchas that got me were:
  1. too much monitoring in JMeter causes it to use a lot of memory and start performing poorly thus giving false data.
  2. my test code had a memory leak. I had to connect to the system I was testing using in-house code. This code tries to connect to a messaging system to send audits for each request to the system. If it cannot, it quietly stores the messages in memory for sending later. And so, it is the test harness that slowly runs out of memory not the system under test. This is generally not your first thought. I fixed this problem by getting the source code of the client JAR and commenting out this call before redeploying the test harness.
In each case, I spotted the problem by attaching JConsole to the test harness.

Performance Testing

Put together a test with typical usage metrics. I once worked for a company that had outsourced a lot of it's coding. The legal draft contract stated that access to page X would take place in no more than 2 seconds. The vendor must have been rubbing its hands with glee as that should have been easy to achieve with just one person using the system. The draft never specified the number of concurrent users. If the system didn't scale, it was legally no longer the vendor's problem. They had met their contractual obligation.

"Performance tests ... often require both hardware and software instrumentation" says Pressman. On Linux, it should be sufficient to run the top command to get a good idea of CPU usage (vmstat is also useful if you want to monitor IO). The stat that is the most interesting is the load average. If this number is greater than your number of CPUs, work is starting to backlog. Or, to put it another way:

“If there are four CPUs on a machine and the reported one-minute load average is 4.00, the machine has been utilizing its processors perfectly for the last 60 seconds”
Linux Journal Dec 1 2006

Profiling

You can use your performance tests to drive your system while you profile the JVM. There are (as ever) some caveats:
  1. There's a testing version of the Heisenberg Uncertainty Principle: the act of observation affects the system. Profiling adds an overhead so do this as well as your non-profiling performance tests.
  2. Just because your profiler shows that a particular method takes a long time it doesn't mean it did a lot of work. Message driven systems will typically spend a lot of time waiting for something to happen. This shows up in the profiler as a method that is "hogging" CPU time. This is a false interpretation. Another common time "hog" is waiting IO code - typically this is an order of magnitude greater than your code in a server. So don't be surprised if you see a lot of time spent here:
java.net.SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int)
java.net.SocketInputStream.read(byte[], int, int)
java.io.BufferedInputStream.fill()
java.io.BufferedInputStream.read()
java.io.DataInputStream.readInt()

Tomcat Tuning

The continuing hunt for performance improvements in our app continues - this time in the Web Tier.

I tried adding the following JVM args when I ran the JMeter performance tests:

-XX:+AggressiveOpts
Uses performance improvements in the JVM that are still experimental in a given release [1].

–XX:+
UseFastAccessorMethods
Inlines getXXX() methods for primitives [2].

–XX:+
CompileThreshold
Controls when HotSpot compiles code into native bytecode. High numbers mean faster start up time, low numbers mean the code is optimized sooner. For a server, a low number may be desirable [3].

-XX:+UseBiasedLocking

Allows threads to have an affinity with a lock they frequently use. Makes attaining that lock faster [4].

There are many others but from reading the literature, these looked likely candidates.

Unfortunately I saw no improvements on my machine. This last point is important. Currently, we have a shortage of QA machines so I had to try the tests on my Windows machine. I protested that this was not an accurate way of doing things. For instance, adding the –server JVM arg massively improved performance. I was quite excited until I realized that for multi-core Linux machines such as our production boxes this is set as default. On 32-bit Windows machines, it is not [5].

So, I gave up on JVM tuning and looked at Tomcat config. The obvious first choice was to use NIO (non blocking IO). This is easy. You just make your connector in server.xml look like this:

<connector port="8080"
protocol="org.apache.coyote.http11.Http11NioProtocol"

socket.rxBufSize="100000"
socket.txBufSize="100000"
socket.directBuffer="true"
connectionTimeout="20000"
redirectPort="8443" />

But although this massively reduced the number of threads from roughly one thread per connection to about half a dozen, it didn't seem to speed things up.

Time to fire up the profiler of choice, YourKit. Under normal load, the threads look like this:
Fair enough. A spot of contention (red) but mostly waiting (yellow) for something to happen (eg, an incoming request). But some waiting was for a message from an internal library that was developed in-house by another team (profilers can't distinguish between legitimate waiting and unnecessary waiting).

So, I looked at the thread the third party library gave us and it looks like this:
Green indicates the thread is doing work - light green when it executes in native code, dark green for all other code. The light green in this picture is just listening on a socket in native code, the dark green is just one thread that happens to be doing a lot of serialization (see the bottom half of the screen).

So, we have a bottle neck that's nothing to do with Tomcat. Tomorrow, I am going to rip out this in-house library as it's caused us nothing but pain and go back to Java RMI. Then, let's see how much we can tune Tomcat.

[1] http://www.oracle.com/technetwork/java/tuning-139912.html#section4.2.7
[2] http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
[3] http://java.sun.com/docs/books/performance/1st_edition/html/JPAppHotspot.fm.html
[4] http://www.oracle.com/technetwork/java/tuning-139912.html#section4.2.5
[5] http://download-llnw.oracle.com/javase/6/docs/technotes/guides/vm/server-class.html


Saturday, July 31, 2010

There is nothing new under the sun

One of team (AJ) was sent on an Test Driven Development course this week and came back and gave a show-and-tell at the weekly retrospective. Some interesting points came out of the subsequent discussion

AJ said that he had been taught the definition of "legacy code" was code that had no tests (I assume, after a quick Google, this came from Michael Feather's book Working Effectively with Legacy Code).

But, I pointed out, if we accept this definition then all test code was therefore legacy code. Who has tests for tests - Quis custodiet ipsos custodes?

So, what was the best way to avoid the pain often associated with "legacy" code? Afterall, some of our tests are now over 3 years old. So, we decided:
  • Making methods so short that unit testing becomes almost trivial.
  • If the method is truly trivial, there is little point in testing it. (We don't like unit testing everything since that sounds like we're just ticking check boxes for the Time and Motion man).
  • Do as little mocking as possible since writing expectations can lead to hard to maintain code. Instead, override methods in the (production) class that would call other classes/make network calls etc with methods in a (test) subclass is preferred.
  • Continue using EasyMock on the projects that already use it but use Mockito for new projects since test code is much more succinct with Mockito. One team member was alarmed that we would then have two frameworks for testing but was won over by the rest of the team who had used it and said it was trivial to learn.
  • If we're just quickly trying to get the test to pass rather than thinking about why it's failing, the test serves no purpose.
AJ also mentioned Acceptance Testing. This caused some confusion as to what exactly this was. For those of us who were brought up on Pressman, the definition is:

Conducted by the end user rather than the system developer, an acceptance test can range from an informal "test drive" to a planned and systematically executed series of tests.
(Software Engineering: A Practitioners Approach, European Adaptation, Fourth Edition, p522 - Roger S Pressman)

More than once, I've heard people use the term "acceptance tests" and assumed that their audience knew they meant automated, regression acceptance tests as encouraged by frameworks such as Fitnesse. Pressman's definition is more general. So, I started Googling for formal definitions of terms the team was discussing.


Test Driven Development

The notable characteristics of TDD are:
  • automated testing [1]
  • regression testing
  • black- and white-box testing [2]
  • unit testing [1]
  • test-first programming [1] ("write a little test that doesn't work, and perhaps doesn't even compile to begin with").
[1] Test Driven Development: By Example - Kent Beck
[2] http://www.threeriversinstitute.org/Testing%20Dichotomies%20and%20TDD.htm

The ubiquitous JUnit is an example of a TDD framework.

Team observations: write-first testing is nice but not to be dogmatically enforced; just because you're using JUnit doesn't mean you're unit testing.


Behaviour Driven Development
, to give it its British spelling, was a term coined by fellow Brit, Dan North. It is a methodology that brings together:
  • automated testing [3]
  • regression testing [3]
  • black-box testing
  • acceptance testing [3]
[3] http://www.slideshare.net/skillsmatter/bdd-introduction

It distinguishes itself from other methodologies in that "acceptance criteria should be executable". To this end, it uses a domain specific language that looks like natural language. This makes the tests easy to read by the business.

JBehave is an example of a framework that facilitates this.

Team observation: these tests take a long time to run - which can hamper productivity; testers and business people are supposed to write the tests but that very, very rarely happens and instead the developer does this.


This is a very incomplete list of terms. Ideally, I want to launch Gimp and put together some matrix of which methodologies include what elements. Although they often have some very nice frameworks, new methodologies often claim to be paradigm shifts (Dan North calls BDD "a second generation ... methodology" [3]). Really, they are emphasising ideas that have been long established.

Wednesday, July 28, 2010

ActiveMQ Patch Accepted

Yay! The Apache ActiveMQ team have accepted my patch!

This fixes a bug that prevented the consumption of relevant messages from a Topic for which we have a durable subscription if it were full of messages we were not interested in.

The bug manifested itself when your consuming thread simply hung despite perfectly good messages being added to the JMS Topic.

Fortunately, ActiveMQ comes with MBeans that are easily activated. This allowed us to watch what the little blighter was doing with the much under-utilized JConsole. Service stopped when we saw the number of irrelevant messages in the Topic hit 100 - the default size of the buffer. This was the clue that prompted me to go off and look in the code at where my consuming thread was blocking (JConsole also shows where all the JVM's threads are).

Incidentally, I'm surprised how few really good Java developers don't use JConsole - despite the fact you get it for free with your JDK. For monitoring an application in production, you can't beat it. Other commonly used libraries often come with MBeans ready to deploy. For instance, did you know that you can change the Log4J logging threshold at runtime? If you want to make your own MBeans and you're using Spring, it couldn't be easier. You just need to have a org.springframework.jmx.export.MBeanExporter in your context and annotate your class with @ManagedResource.

Saturday, July 10, 2010

Brief comparison of locking between Oracle, Postgres and Sybase

I'm always surprised at how little most server-side developers know about databases even when all but the most trivial application them. And I am not talking about understanding SQL (worthy cause as it is) but understanding what is happening in the DB regarding locks. You generally don't have any problems at this level during testing. But under a non-trivial load they are a major concern.

So, partly to remind myself, here are some notes on the databases of which I have experience.

Sybase uses (exclusive) write and (shared) read locks extensively. If you update any data, you have an exclusive lock on that data until the end of the transaction. No other thread can even read it. "That data" can be anything. It doesn't need to be a row. It could be a table or an index. So, perhaps surprisingly, these two threads block each other:

TX 1
begin tran
TX2
begin tran
insert into X values ('y')
insert into X values ('y')
[Blocked by TX1]
.
.
.
commit tran
[Released by TX1]

.
.
.
commit tran

They are dealing with different rows but TX1 holds an exclusive lock on the table.

Oracle and Postgres deal with things differently. They use Multiversion Concurrency Control. At the start of a transaction, you get something similar to a snapshot of the database that you can play with until your "merge" your changes into the "trunk" (note: this is only a loose analogy and there are areas where it breaks down - see later).

So, these two (Oracle) processes don't block each other:

TX 1
set transaction isolation level read committed ;
TX2
set transaction isolation level read committed ;
insert into X values ('y')
insert into X values ('y')
commit;
commit;

They only really block each other when they really are contending for the same row:

TX 1
set transaction isolation level read committed ;
TX2
set transaction isolation level read committed ;
update X set Y ='y' where Z = z;
update X set Y ='y' where Z = z;
[Blocked by TX1]

.
.
.
commit;
[Released by TX1]
commit;

But (unlike Sybase) TX2 is not blocked in Oracle or Postgres in this scenario:

TX 1
set transaction isolation level read committed ;
TX2
set transaction isolation level read committed ;
update X set Y ='y2' where Z = z;
select Y from X where Z = z;
[sees the old value for Y]
commit;
commit;

Sybase would block TX2 upon the select command until TX1 committed or rolled back. This leads to an interesting unintuitive corner case in Sybase that I have seen. Since a shared read lock is still a lock, you can have deadlocks between two transactions even if one of those transactions is read-only. For instance, if I issue a select * where ... on a table, I start locking rows in the table. If another transaction updates a row that my select process has not yet reached then tries to update a row my select process already has a lock on - deadlock. Sybase chooses one transaction to be the loser. By default, this is the transaction that has done the least work.

So far, I've discussed the READ COMMITTED isolation level - the default on all three databases. The fun starts when you use SERIALIZABLE.

In Sybase:

TX1

set transaction isolation level 3 -- SERIALIZABLE
begin tran
select count(z) from X

TX2
set transaction isolation level 1 -- READ_COMMITTED

begin tran
insert into X values ('y')
[Blocked by TX1]

.
.
.
commit;
[Released by TX1]
commit;

Note: if TX1 had been at isolation level 1 (READ_COMMITTED) there would have been no blocking in TX2.

Unsurprisingly, Oracle and Postgres don't block:

TX1
set transaction isolation level serializable;
TX2
set transaction isolation level serializable;
insert into X values ('y');
select count(z) from X
[sees value ' w', say]
insert into X values ('y')
select count(z) from X
[sees value 'w' too]
commit;
select count(z) from X
[still sees value 'w' !]
.
.
.
commit;

So, although Sybase blocks (which can be somewhat inefficient with anything but the shortest transactions), it is consistent. In Oracle and Postgres, you will continue to see the same value for as long as the transaction continues even though the data is stale and a perfectly good change has been successfully committed.

Oracle has to keep hold of this snapshot to maintain the SERIALIZABLE contract for potentially forever (or at least until you get "ORA-01550 snapshot too old"...).

And whereas Sybase guarantees that your view of the data always SERIALIZABLE, Oracle and Postgres do not.


TX1
set transaction isolation level serializable;
TX2
set transaction isolation level serializable;
delete from X where Z=z;
update X set Y='y2' where Z=z
[ERROR!]
commit;

Oracle gives you:

Error: ORA-08177: can't serialize access for this transaction

and Postgres gives you:

ERROR: could not serialize access due to concurrent update
SQL state: 40001

Interestingly, the criteria of SERIALIZABLE is defined as being met if the final database state is equal to all the queries being run one after the other with no overlap. Note, the DB implementation is not obligated to actually do this, just that the final state must be equivalent to this behaviour. The Sybase programmers chose to guarantee this every time. The Oracle and Postgres developers chose to make a best-effort and to throw an exception if the contract could not be honoured.

There will always be subtle differences between implementations. So, Java developers, get to know a little about your database of choice. (Oh, and keeping transactions as short as possible is a good idea in all three databases.)

Aside: it's a pity that the Exceptions in the JDBC spec are so non-specific. I get a SQLException if the database is down (from which I can't really recover). And I get a SQLException if my transaction was deadlocked (from which I can recover, namely by trying again.) A missed opportunity.

Sunday, July 4, 2010

Concurrent modification of data structures

Sharing data is something that seems easy but has many subtle problems. When I architected a credit rating system in the past, I chose to make all the domain objects immutable. Although most developers were fine with this approach, it made a number of others antsy as manipulating the data became a little more unwieldy. By definition, they couldn't change the data (make a similar object with new fields) and felt the constructors took too many arguments since everything had to be set at object creation time (decompose the model).

Anyway, I'm glad we went with that architecture all those years ago since now I am on a completely different project, I have to maintain code that did not adopt this pattern. As a result, things are messy.

As an aside, hire good developers. Good architects and good managers are for naught if the guys at the code face are not naturals.

First off, in the current mutable architecture, threads were taking the objects from the cache and manipulating them for their own purpose. This changed the underlying data. Oops. So, this was remedied quite early by making a copy using simple Java serialization.

But frequently copying large objects using serialization is surprisingly expensive. With 70 threads reading a moderately sized object (~100s of KB) and 20 updating it, each every 45s, my four-core Linux box is maxing out.

Whatsmore, I'm seeing plenty of ConcurrentModificationExceptions in the logs as the threads performing updates to the data interfere with the threads performing the serialization. This forces a retry which consumes yet more CPU cycles.

And to top it all, occasionally one read thread is stuck in an infinite loop and consumes all the CPU as another thread changed a java.util.HashSet (read more about this pathological phenomena here).

Sure, you could say well the JavaDoc clearly says don't change non-thread-safe collections while another thread is reading it. But most developers don't reckon on somebody serializing an object of their class while somebody else updates a collection within it.

And sure, you could say use ConcurrentHashMap and its ilk ("They do not throw ConcurrentModificationException" and "all operations are thread-safe") but we never chose the implementation of our maps and sets - Hibernate did as it rehydrated the object.

These are innocent mistakes (albeit with potentially devastating impacts). Less forgivable perhaps are developers updating java.util.Sets with code like:

aSet.remove(anElement);
aSet.add(anElement);

The equality of our anElement object is based on its Primary Key, so that's sensible. But of course, in a shared data world, this introduces a horrible race condition. For a moment, aSet does not contain anElement. Obvious when you think about it but although this code was well unit tested, the developer never thought about concurrency. It only manifests itself as a bug when the system is under a heavy load.

My patch involved using RentrantReadWriteLock that allowed me to have non-blocking ReadLocks and exclusive WriteLocks. Firing up JMeter and running my stress tests showed a moderate degradation in performance (5-10%) but under heavy load, 0% of errors (compared to 5% of errors without this locking due to race conditions).

Ideally, I'd like to re-write this crufty code entirely and wrap changes to the underlying data structure in Command objects that can only be executed in one class. This would give me a centralized place to implement and manage the locking code. Unfortunately, access to the underlying data is currently scattered across this component.

Alas, it's unlikely that I'll get to do this as for the client my current solution is "good enough".

As for improving the performance, I managed to squeeze a 30% saving by just copying what was needed by the calling code not copying a larger object graph then returning a subset of that data. But that's less interesting.