Thursday, July 11, 2013

The Thrashing of the Garbage Collector

With memory running low, a Java process may grind almost to a halt. This can be due to thrashing in the garbage collector.

A handy utility

Before our JVMs died with OutOfMemoryErrors, I was looking at the GC activity. I used jstat to take a reading every 10s (the 10000 (ms) argument below) and got:

:~> jstat -gc 24519 10000 
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT 
7104.0 7104.0  0.0    0.0   131392.0 131392.0 1165120.0  1165119.9  35712.0 35327.1  12978  170.599 9002  30293.627 30464.226 
7104.0 7104.0  0.0    0.0   131392.0 131392.0 1165120.0  1165119.9  35712.0 35330.2  12978  170.599 9004  30300.773 30471.371 
7104.0 7104.0  0.0    0.0   131392.0 131392.0 1165120.0  1165120.0  35712.0 35329.4  12978  170.599 9007  30312.233 30482.832 
.
.

[24519 was the PID of the process]

FGCT (Full Garbage Collection Time [1]) was the interesting metric. At this point of time, full garbage collection was taking about 10s every 10s or so and the total garbage collection time (GCT) was almost entirely due to it (not the young generation).

[Compare this to a more healthy JVM that had been running for a day:

~> jstat -gc 22226 10000 
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT 
1216.0 1216.0 990.9   0.0   143168.0 27467.5  1165120.0  1088493.7  54016.0 35576.7   9227   89.841  10     15.148  104.989 
1216.0 1216.0  0.0   992.0  143168.0 46244.1  1165120.0  1088829.1  54016.0 35576.7   9228   89.850  10     15.148  104.997 
1216.0 1152.0 974.8   0.0   143232.0 76039.4  1165120.0  1089091.5  54016.0 35576.7   9229   89.856  10     15.148  105.004 

where only 15 seconds out of the last 24 hours has been due to a full garbage collection.]

Full garbage collection is a stop-the-world event [2, 3, 4] so this explains why the application's responsiveness becomes so poor. It may be worth keeping an eye on this figure.

Parallel Scavenge GC

The default garbage collector on our box was the Parallel Scavenger. I've talked about the Young Generations behaviour with this GC here. But the Old Generation GC has a different strategy. It's called PS MarkSweep. This uses the well known mark-and-sweep algorithm but also a compaction algorithm to stop the memory space looking like Swiss cheese [5]. This can be expensive.

What actually triggers an OOME?

What triggers an OOME is not necessarily there not being enough memory to instantiate an object. Amongst other reasons are "98%+ time is spent in GC" and "less than 2% heap is freed in a collection" [2]. This is not very deterministic.


[1] http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html
[2] QCon lecture
[3] http://www.fasterj.com/articles/oraclecollectors1.shtml
[4] http://apmblog.compuware.com/2011/03/10/major-gcs-separating-myth-from-reality/
[5] Dr Richard Warburton's blog 

Sunday, July 7, 2013

AVL Trees

An interesting problem was presented to us of determining the 90th percentile of a moving window of stress test timings. It must do it efficiently and without running out of memory (thousands of samples per second over the course of hours or days).

My solution was to use a tree-like structure that was the underlying container for an ordered bag. A subset of all data can live in memory - say 100 data points. We know the 90th percentile will start at the node where 90 smaller points are to the left and the remainder are to the right.

(We maintain a sliding window by removing the oldest point before we add the next.)

The tree structure I chose was an AVL tree. "An AVL tree is a BST [binary search tree] where the height of every node and that of its sibling differ by at most 1." [1]

First a quick overview of Binary Search Trees.

2-3 Search Tree

"A 2-3 search tree is a tree that is either empty or

  • A 2-node, with one key (and associated value) and two links, a left link to a 2-3 search tree with smaller keys, and a right link to a 2-3 search tree with larger keys.
  • A 3-node, with two keys (and associated values) and three links,  a 2-3 search tree with smaller keys, a middle link to a 2-3 search tree with keys between the node's keys, and a right link to a 2-3 search tree with larger keys." [1]
(See the diagram below).


Red-Black Tree

Perhaps the most famous Binary Search Tree is the Red-Black tree. This is the underlying implementation of Java's TreeMap.

From Sedgewick's Algorithms:

  • Red links lean left
  • No node has two red links connected to it.
  • The tree has perfect black balance: every path from the root to a null link has the same number of black links.
"There is a 1-1 correspondence between red-black BSTs defined in this way and 2-3 trees."

Isomorphic mapping between a Red-Black tree and a 2-3 tree.

AVL Trees

The first configurations to consider are left-left:

Added 50, 25 then 10
25 swaps places with its parent who in turn becomes its right child.

left-left balanced
and right-right:

Added 50, 75 then 90
75 swaps places with its parent who then becomes its left child

Right-right balanced
This considers the case of a left-leaning and a right-leaning configuration. But what if we have a left-right

Added 50, 25 then 40

 and a right-left addition of numbers?

Added 50, 75, then 60
Well, for left-right, 40 comes up and swaps places with its parent:

Partial rebalancing of left-right

and for right-left, 60 comes up and swaps places with its parent:

Partial rebalancing of right-left

After that, they are now back to a left-left and a right-right configuration so we continue as described above.

(Graphics generated by QMatica's excellent interactive AVL-tree demo).

[1] Sedgewick's Algorithms.

Saturday, July 6, 2013

Grid Computing vs. Distributed Cache Locking

If you want to change a piece of data in a distributed cache you could lock the data, change it then release the lock. But there is a more efficient way.

The idea is to leverage grid computing to execute a unit of work on the cache that does not require any locking. The advantage to this is that there are fewer network calls (one call to execute the work rather than the three to lock, mutate, unlock).

This architectural pattern works for two main reasons:

  1. Key affinity - requests for a given piece of data get routed to the same server in the cluster.
  2. One thread per key - although the host node may have a thread pool, only one of its threads can execute work against a particular key at any one time.
In Oracle's Coherence, this is achieved by implementing the EntryProcessor interface. "Within the process method, you don't need to worry about concurrency - Coherence guarantees that the individual entry processors against the same entry will execute atomically and in the order of arrival, which greatly simplifies the processing logic. It also guarantees that the processor will be executed in the case of server failure, by failing it over to the node that becomes the new owner of the entry it needs to process." - Oracle Coherence 3.5.



Common NIO mistakes

Just for grins, I've been writing my own NIO server. Ostensibly, it's to provide a server that does not need to create lots of String objects (whose constant creation and destruction can hammer performance). Secondarily, I'm learning more about the minutiea of NIO.

These are the silliest mistakes I've made so far:


Buffer Reading and Writing

Most Java developers never have to worry about this as it's all hidden by your container of choice. But if you roll your own, remember the order for reading is read-flip-get. That is:

ReadableByteChannel.read(ByteBuffer)
Buffer.flip()
ByteBuffer.get(...)

For writing, it's put-flip-write, that is:

ByteBuffer.put(...)
Buffer.flip()
WritableByteChannel.write(ByteBuffer)


ServerSocketChannel Backlogs

Don't forget to set this in the bind method or you may be left wondering why your clients start throwing exceptions under a modest load. If too many connections swamp the serber, client's will see exceptions that say something like "connection reset by peer".


Call register() and select() in the Same Thread

It's not obvious, but you must call AbstractSelectableChannel.register and Selector.select(...) in the same thread [1].

[1] Stack Overflow

Anatomy of a Memory Leak

Asking around other developers, it seems that the average analysis for a memory leak can easily take 2 to 3 weeks. When you're working in a 40 node cluster, things can be even more complicated.

This week, I found a memory leak that is only triggered under certain circumstances. What they are is not terribly interesting for somebody outside our project. But the means with which I found it may be.

Basically, the collection of objects to be reaped by our application was empty. But by dumping the JVMs heap (with jmap -dump:live,format=b,file=XXX PID) and analysing the heap with YourKit, I saw that not only was the collection empty, it had never had anything put into it.

How I did this was by looking at HashMap's modCount field. According to the source code, this is the "number of times this HashMap has been structurally modified" (note: replacing a key/value mapping does not count as a modification).

This field is normally used to ensure that iterators are fail-fast. That is, when a call is made to the iterator() method of this data structure, the newly created iterator makes a note of the modCount. Upon calls made to the iterator object, the modCount is checked and if it's not what was expected, an exception is thrown.

How does this help us in interpreting the entrails of a heap dump? Well, if the modCount is 0, chances are (2^31) that nothing ever went into it. If nothing went into it, that path through your code wasn't executed. And that was enough to tell me what the bug was.

HashMap is not the only data structure to use the idea of a modCount. Even the segments of ConcurrentHashMap use this idea. It's well worth checking when you're trying to work out what went wrong.