Sunday, July 4, 2010

Concurrent modification of data structures

Sharing data is something that seems easy but has many subtle problems. When I architected a credit rating system in the past, I chose to make all the domain objects immutable. Although most developers were fine with this approach, it made a number of others antsy as manipulating the data became a little more unwieldy. By definition, they couldn't change the data (make a similar object with new fields) and felt the constructors took too many arguments since everything had to be set at object creation time (decompose the model).

Anyway, I'm glad we went with that architecture all those years ago since now I am on a completely different project, I have to maintain code that did not adopt this pattern. As a result, things are messy.

As an aside, hire good developers. Good architects and good managers are for naught if the guys at the code face are not naturals.

First off, in the current mutable architecture, threads were taking the objects from the cache and manipulating them for their own purpose. This changed the underlying data. Oops. So, this was remedied quite early by making a copy using simple Java serialization.

But frequently copying large objects using serialization is surprisingly expensive. With 70 threads reading a moderately sized object (~100s of KB) and 20 updating it, each every 45s, my four-core Linux box is maxing out.

Whatsmore, I'm seeing plenty of ConcurrentModificationExceptions in the logs as the threads performing updates to the data interfere with the threads performing the serialization. This forces a retry which consumes yet more CPU cycles.

And to top it all, occasionally one read thread is stuck in an infinite loop and consumes all the CPU as another thread changed a java.util.HashSet (read more about this pathological phenomena here).

Sure, you could say well the JavaDoc clearly says don't change non-thread-safe collections while another thread is reading it. But most developers don't reckon on somebody serializing an object of their class while somebody else updates a collection within it.

And sure, you could say use ConcurrentHashMap and its ilk ("They do not throw ConcurrentModificationException" and "all operations are thread-safe") but we never chose the implementation of our maps and sets - Hibernate did as it rehydrated the object.

These are innocent mistakes (albeit with potentially devastating impacts). Less forgivable perhaps are developers updating java.util.Sets with code like:

aSet.remove(anElement);
aSet.add(anElement);

The equality of our anElement object is based on its Primary Key, so that's sensible. But of course, in a shared data world, this introduces a horrible race condition. For a moment, aSet does not contain anElement. Obvious when you think about it but although this code was well unit tested, the developer never thought about concurrency. It only manifests itself as a bug when the system is under a heavy load.

My patch involved using RentrantReadWriteLock that allowed me to have non-blocking ReadLocks and exclusive WriteLocks. Firing up JMeter and running my stress tests showed a moderate degradation in performance (5-10%) but under heavy load, 0% of errors (compared to 5% of errors without this locking due to race conditions).

Ideally, I'd like to re-write this crufty code entirely and wrap changes to the underlying data structure in Command objects that can only be executed in one class. This would give me a centralized place to implement and manage the locking code. Unfortunately, access to the underlying data is currently scattered across this component.

Alas, it's unlikely that I'll get to do this as for the client my current solution is "good enough".

As for improving the performance, I managed to squeeze a 30% saving by just copying what was needed by the calling code not copying a larger object graph then returning a subset of that data. But that's less interesting.

2 comments:

  1. ConcurrentModificationException happening in a production environment is really a bug. Though you manually use ReentrantLock to solve it, you still should build a solid coding specification for your developers. Every serious Java developer should be informed to consider concurrency in their mind. It's really not the responsibilities of the architects.

    ReplyDelete
  2. ConcurrentModificationException happening in a production environment is really a bug. Though you manually use ReentrantLock to solve it, you still should build a solid coding specification for your developers. Every serious Java developer should be informed to consider concurrency in their mind. It's really not the responsibilities of the architects.

    ReplyDelete