Agile Java Man: 2012

Saturday, December 22, 2012

...Except Some Are More Equal Than Others

(Or NUMA, NUMA yay!)

Not all memory is equal. Some is faster to access that others. And how you access it is also important.

If you read through arrays of different lengths, you will probably get a performance profile that looks something like this:

Figure 1. Number of Java longs read per nanosecond vs. length of array (bytes)

Performance drops off around the 6MiB mark as that's the size of the L2 cache on my MacBook Pro (Intel Core 2 Duo). I know it's 6MiB by running:

phillip:~ henryp$ sysctl -a | grep cache
.
.

hw.cachelinesize = 64
hw.l1icachesize = 32768
hw.l1dcachesize = 32768
hw.l2cachesize = 6291456
.
.

It looks like what we're seeing is that as the L2 cache fills up with the array, performance drops.

A similar thing happens with the L1 cache which is 32KiB on this machine. (Note, it's the l1dcachesize that represents the memory cache. "L1d is the level 1 data cache, L1i the level 1 instruction cache" [1]).

Running the same code that generated the data in Figure 1 but with a much smaller range of array sizes, we get something like:

Figure 2. Number of Java longs read per nanosecond vs. length of array (bytes)

Here we see performance dropping off when arrays are about 32k. So much we'd expect. But look at the left hand side of the scatter graph. It takes a while for performance to reach the optimal point. This is nothing to do with something like the HotSpot compiler kicking in as all results in this post were from the second run. (The first run's results were discarded as the performance at start up was orders of magnitude slower than anything that followed).

It appears that the reason it's being slow initially is that an optimization has yet to kick in. That is, "Sequential locality, [which is] a special case of spatial locality, occurs when data elements are arranged and accessed linearly, e.g., traversing the elements in a one-dimensional array." [3]

The Intel docs say:

"The Intel® Core™ i7 processor has a 4 component hardware prefetcher very similar to
that of the Core™ processors. Two components associated with the L2 CACHE and two
components associated with the L1 data cache. The 2 components of L2 CACHE
hardware prefetcher are similar to those in the Pentium™ 4 and Core™ processors. There
is a “streaming” component that looks for multiple accesses in a local address window as
a trigger and an “adjacency” component that causes 2 lines to be fetched instead of one
with each triggering of the “streaming” component. The L1 data cache prefetcher is
similar to the L1 data cache prefetcher familiar from the Core™ processors. It has another
“streaming” component (which was usually disabled in the bios’ for the Core™
processors) and a “stride” or “IP” component that detected constant stride accesses at
individual instruction pointers. " [2]

It seems reasonable that this prefetching is responsible for the average time to read data from memory initially being low for smaller arrays. That is, memory needs to be accesses in a linear manner for some time before the CPU can assume that it will continue to be accessed in a similar manner. But more evidence will have to wait for my new (Linux) machine to arrive next week where I can there execute the commands to demonstrate what is happening to the machine's cache [4].

[1] What Every Programmer Should Know About Memory, Ulrich Drepper

[2] "Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors"

[3] Wikipedia.
[4] Martin Thompson's blog.

Sunday, December 9, 2012

Queueing Theory and Practice

Being a Brit, I love queueing. However, I hate it when my software does it.

This week, I've been wondering why we're seeing a lot of contention in our application. It comes mainly via JAXP code as it makes calls to the class loader (see YourKit screen grab below).

Quite why the Sun XML classes keep calling loadClass on the Tomcat class loader for the same class is currently beyond me.

[Aside: set the jaxp.debug system variable to see greater visibility on what the JAXP code is doing. I saw this defined in javax.xml.xpath.XPathFactoryFinder. See if your Java implementation also uses this to set a debugging field.]

Anyway, when I thought I had identified the problem, it occurred to me that this contention was only such a major problem when I started profiling. Like quantum mechanics, I was affecting the results by taking measurements.

It seemed that the slowdown in performance caused by attaching monitors to the JVM was enough to push it over the edge - thread contention went crazy. A quick Google gave me the equations to describe a M/D/1 queue (that is, a queue where arrival is randoM, servicing the request is Deterministic and there is only 1 servicer. This seems to model our situation well in that any thread can make a call to loadClass but they should all be serviced in roughly the same time).

I used Mathematica to plot these queue equations. Here is the expected average wait time, g(p, u):

g[p_, u_] := p/(2 u (1 - p))
g::usage = "Expected average waiting time"

Manipulate[
Plot[{g[p, 1], g[p, u], g[p, 10]}, {p, 0, 1.0},
ImageSize -> {600, 320},
ImagePadding -> {{25, 10}, {25, 10}},
Frame -> True,
AxesLabel -> {"arrival/service rate", "expected average waiting time"}],
{{u, 2, "Service Rate"}, 0.1, 100, Appearance -> "Labeled"},
FrameLabel ->
Style["M/D/1 case (random Arrival, Deterministic service, and one service \
channel)", Medium]
]

where p is

rate work arrives / rate work is serviced

and u is the rate work is serviced.

The graph looks like this (where u is 1, 10 and floating):

The important thing here is that it is not a linear graph. With a service rate of 2 for our variable graph (the purple line), we see that the number in the queue is about 1.0 when the ratio of work arriving to being processed is 80% but if it goes to 90%, the expected queue size goes to about 2.5.

At this usage profile, a 10% change means the average wait time more than doubles.

This could explain why the system became so badly behaved when experiencing only a moderate increase in load.

Saturday, December 8, 2012

The thundering herd

Last week, I wanted a blocking queue that had all the characteristics of a priority queue so I wrote my own. You can find it in the open source Baron Greenback library here. It decorates a PriorityQueue rather than extending it as I didn't want to violate the contract that PriorityQueue provides regarding being unbounded.

My first cut of the code used Conditions hanging off a ReentrantLock and called signalAll. This can be optimised by instead calling signal instead.

From Java Concurrency in Practise:

Using notifyAll when only one thread can make progress is inefficient - sometimes a little, sometimes grossly so. If ten threads are waiting on a condition queue, calling notifyAll causes each of them to wake up and contend for the lock; then most or all of them will go right back to sleep. This means a lot of context switches and a lot of contended lock acquisitions for each event that enables (maybe) a single thread to make progress. (p303)

The Condition class "factors out the Object monitor methods (wait, notify and notifyAll) into distinct object" and so are analogous to the notify and notifyAll methods.

While Goetz warns that conditional notification "is difficult to get right" (you may wake up a thread that sees it doesn't fit the condition and then goes back to being suspended without waking anybody else up) as a rule one should use notify/signal rather than notifyAll/signalAll on an object that is acting as a semaphore.

Addendum: there is an excellent explanation at StackOverflow here that demonstrates getting the choice between notify/notifyAll wrong can be very dangerous indeed. It uses the example of many consumers and producers popping and pushing a stack. The takeaway point is that notifyAll wakes all threads and all of them will contend for the lock and then check the condition sequentially. Whereas a call to notify wakes only one thread and that thread might be a producer who cannot push any more because the data structure is full and therefore waits - ergo, deadlock.

Wednesday, December 5, 2012

Persistent but not Persistent

When is a persistent object not necessarily persistent?

You might think that persistent means that the object can be stored in some durable data store. But there is another meaning [1].

This other meaning describes an optimal way of dealing with data structures. Instead of mutating these data structures, they are decorated with an object that represents the changed structure.

This is easier for some structures than others. For instance, in a linked list each element only knows about the element in front of it. So, if you wanted to add a new chain to the end of the list, you'd just create a new object that represented that chain and whose head pointed at the tail of the original list.

The important thing to remember is that the original list has not changed. So, there is no need to worry about mutexes or race conditions.

There are downsides. If you want to insert in the middle of a single linked list, you'd have to copy the elements in the old list that followed your insertion point into your new chain.

Also, this technique does not work with some data structures, eg double linked lists.

That said, the Java collection classes employ this technique to good effect. If you look at TreeMap.subSet(...) for instance you'll see a new set is returned that contains a subset of the original by wrapping it.

Note: not only do changes in one set affect the other, the original data structure stays in memory even if all other references to it disappear. We found this the hard way when we thought we'd discard elements of a large list but found we were still consuming large amounts of memory.

[1] Persistent Data Structures.

Saturday, November 24, 2012

Algorithm of the Week

In a week of cool algorithms, this one in java.util.PriorityQueue was the most interesting.

If you step through the code adding and remove-ing elements, you'll see the array represented by PriorityQueue.queue changing but it doesn't represent the order that elements come out of the queue. The documentation says the "Priority queue [is] represented as a balanced binary heap".

What is a binary heap?

"Definition: A binary heap is a collection of keys arranged in a complete heap-ordered binary tree, represented in level order in an array (not using the first entry).

"In a binary heap, the keys are stored in an array such that each key is guaranteed to be larger than (or equal to) two additional keys, and so forth.

"Definition: A binary tree is heap-ordered if the key in each node is larger than or equal to the keys in that node's two children (if any). Equivalently, the key in each node of a heap-ordered binary tree is smaller than or equal to the key in that node's parent (if any). Moving up from any node, we get a nondecreasing sequence of keys; moving down from any node, we get a nonincreasing sequence of keys. In particular... the largest key in a heap-ordered binary tree is found at the root." - Algorithms, Sedgewick & Wayne.

So, a tree that looks like this:

[Graphic taken from the Wikipedia entry]

maps isomorphically to an array, in this case 100, 19, 36, 17, 3, 25, 1, 2, 7. The rule is (from comments in PriorityQueue):

the two children of queue[n] are queue[2*n+1] and queue[2*(n+1)]

And "if multiple elements are tied for least value, the head is one of those elements -- ties are broken arbitrarily."

One last thing, the order of child nodes is irrelevant.

Sunday, November 18, 2012

Intrinsic Methods

If you copy and paste some of the JVM code into your own class you might be surprised that it doesn't run as fast as when it runs in its original library.

Hotspot does this by swapping Java code out and replacing it with hand-written code that is optimized to a particular architecture. Michael Barker shows on his blog how java.lang.Integer.bitCount() is replaced with an Intel ("Nehalem and later") instruction popcnt (this appears to be defined in jdk7/hotspot/src/cpu/x86/vm/x86_*.ad)

This surprising phenomena is due to some methods being intrinsic methods. Oracle's Performance Tuning wiki page is somewhat quiet on the subject but it does point in the direction of library_call.cpp and vmSymbols.hpp.

The classes that are mentioned as having intrinsics can be found with:

$ grep do_intrinsic vmSymbols.hpp | perl -pe s/\#.*//g | perl -pe s/\\/\\/.*//g | awk '{ print $2 }' | sort | uniq | perl -pe s/_/./g | perl -pe s/,//g

java.lang.Boolean
java.lang.Byte
java.lang.Character
java.lang.Class
java.lang.Double
java.lang.Float
java.lang.Integer
java.lang.invoke.InvokeDynamic
java.lang.invoke.MethodHandle
java.lang.invoke.MethodHandleNatives
java.lang.Long
java.lang.Math
java.lang.Object
java.lang.reflect.Array
java.lang.reflect.Method
java.lang.ref.Reference
java.lang.Short
java.lang.String
java.lang.StringBuffer
java.lang.StringBuilder
java.lang.System
java.lang.Thread
java.lang.Throwable
java.nio.Buffer
java.util.Arrays
sun.misc.AtomicLongCSImpl
sun.misc.Unsafe
sun.reflect.Reflection

[Aside: not all of them actually lead to intrinsics. I was looking at the intrinsic for java.nio.Buffer.checkIndex(...) but saw this in library_call.cpp:

case vmIntrinsics::_checkIndex:

// We do not intrinsify this. The optimizer does fine with it.

return NULL;

]

But the java.lang.Math class does seem well represented in x86_64.ad where, for instance, the sin function is mapped to an Intel opcode rather than library calls.

Direct Memory Access

We're currently profiling a Java application that features Lucene. There's some debate as to what is going on. Is our app performing badly because it is IO-bound? This would make sense as we are writing a lot of indexes to disk. So, why is the CPU usage so high?

Would a lot of IO trash the CPU? Kernel writer, Robert Love, suggests not:

"Given that the processors can be orders of magnitude faster than the hardware they talk to, it is not ideal for the kernel to issue a request and wait for a response from the significantly slower hardware. Instead, because the hardware is comparatively slow to respond, the kernel must be free to go and handle other work, dealing with the hardware only after that hardware has actually completed its work." [1]

In most (non-embedded) architectures, it appears that the CPU has very little to do with the heavy-lifting of data. What goes on is this (with a lot explanation stolen from Ulrich Drepper's What Every Programmer Should Know About Memory [2]):

The standardized chipset looks like this:

Let's define some terms:

"All CPUs (two in the previous example, but there can be more) are connected via a common bus (the Front Side Bus, FSB) to the Northbridge.
"The Northbridge contains, among other things, the memory controller.
"The Southbridge often referred to as the I/O bridge, handles communication with devices through a variety of different buses".

The consequences are:

"All data communication from one CPU to another must travel over the same bus used to communicate with the Northbridge.
"All communication with RAM must pass through the Northbridge.
"Communication between a CPU and a device attached to the Southbridge is routed through the Northbridge."

This is where direct memory access (DMA) comes in:

"DMA allows devices, with the help of the Northbridge, to store and receive data in RAM directly without the intervention of the CPU (and its inherent performance cost)." [2]

So, all our IO seems unlikely to be the cause of our CPU problems (caveat: we need to do more testing).

Out of interest, I was reading up on DMA and stumbled on this from the classic Operating Systems [3]:

"The CPU programs the DMA controller by setting its registers so it knows what to transfer where. It also issues a command to the disk controller telling it to read date from the disk into its internal buffer and verify the checksum. When valid data are in the disk controller's buffer, DMA can begin."
"The DMA controller initiates the transfer by issuing a read request over the bus to the disk controller."
"The disk controller fetches the next word from its internal buffer... The write to memory is another standard bus cycle."
"When the write is complete, the disk controller sends an acknowledgement to the [DMA controller], also over the bus. The DMA controller then increments the memory address to use and decrements the byte count."

"If the byte count is still greater than 0, steps 2 through 4 are repeated until the count reaches 0. At this point the controller causes an interrupt. When the operating system starts up, it does not have to copy the block to memory; it is already there".

[1] Linux Kernel Development - Rob Love
[2] What Every Programmer Should Know About Memory - Ulrich Drepper.
[3] Operating Systems - Tanenbaum & Woodhull

Saturday, November 10, 2012

Points on Safepoints

I've been coming across safepoints for over a year without really knowing what they are (you can see safepoints mentioned in the native code JIT produced in an old post). Despite a lot of Googling, I've not been able to find much.

When Gil Tene, CTO and co-founder of Azul Systems, was in town over summer, he gave a lecture on garbage collection and mentioned safepoints. The notes I took were:

All commercial GCs use safepoints.
The GC reigns in all threads at safepoints. This is when it has exact knowledge of things touched by the threads.
They can also be used for non-GC activity like optimization.
A thread at a safepoint is not necessarily idle but it often is.
Safepoint opportunities should be frequent.
All threads need to reach a global safepoint typically every dozen or so instructions (for example, at the end of loops).

Note that safepoints can stop all threads.

These rough notes jotted down are unsatisfactory so I went looking at the code for safepoints in OpenJDK. I'm far from the best C/C++ programmer in the world (caveat emptor), but these are the interesting things I discovered:

From safepoint.cpp:

// Begin the process of bringing the system to a safepoint.
// Java threads can be in several different states and are
// stopped by different mechanisms:
//
// 1. Running interpreted
// The interpeter dispatch table is changed to force it to
// check for a safepoint condition between bytecodes.
// 2. Running in native code
// When returning from the native code, a Java thread must check
// the safepoint _state to see if we must block. If the
// VM thread sees a Java thread in native, it does
// not wait for this thread to block. The order of the memory
// writes and reads of both the safepoint state and the Java
// threads state is critical. In order to guarantee that the
// memory writes are serialized with respect to each other,
// the VM thread issues a memory barrier instruction
// (on MP systems). In order to avoid the overhead of issuing
// a memory barrier for each Java thread making native calls, each Java
// thread performs a write to a single memory page after changing
// the thread state. The VM thread performs a sequence of
// mprotect OS calls which forces all previous writes from all
// Java threads to be serialized. This is done in the
// os::serialize_thread_states() call. This has proven to be
// much more efficient than executing a membar instruction
// on every call to native code.
// 3. Running compiled Code
// Compiled code reads a global (Safepoint Polling) page that
// is set to fault if we are trying to get to a safepoint.
// 4. Blocked
// A thread which is blocked will not be allowed to return from the
// block condition until the safepoint operation is complete.
// 5. In VM or Transitioning between states
// If a Java thread is currently running in the VM or transitioning
// between states, the safepointing code will wait for the thread to
// block itself when it attempts transitions to a new state.
//

This point number 2 is particularly clever. The GC thread can protect some memory to which all threads in the process can write (using the mprotect system call) so they no longer can. Upon accessing this temporarily forbidden memory, a signal handler kicks in (see this previous post). Example code for this can be found on the man pages where it says:

"If the calling process tries to access memory in a manner that violates the protection, then the kernel generates a SIGSEGV signal for the process."

From what I can tell, this corresponds to the signal handling code in hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:

// Check to see if we caught the safepoint code in the
// process of write protecting the memory serialization page.
// It write enables the page immediately after protecting it
// so we can just return to retry the write.
if ((sig == SIGSEGV) &&
os::is_memory_serialize_page(thread, (address) info->si_addr)) {
// Block current thread until the memory serialize page permission restored.
os::block_on_serialize_page_trap();
return true;
}

More safepoint notes to come.

SOA Far, SOA Good

My quibble with SOA patterns is that they tend to be common sense and hardly worth documenting.

A colleague was talking about the Federation Pattern and when I asked what it was, it turns out that it's something we've been happily doing for years.

Thomas Erl describes it in SOA Principles of Service Design as:

"A federated IT environment is one where resources and application are united while maintaing their individual autonomy and self-governance. SOA aims to increase a federated perspective of an enterprise to whatever extent it is applied. It accomplishes this through the widespread deployment of standardized and composable services each of which encapsulates a segment of the enterprise and expresses it in a consistent manner.

"In support of increasing federation, standardization becomes part of the up-front attention each service receives at design time. Ultimately this leads to an environment where enterprise-wide solution logic becomes naturally harmonized, regardless of the nature of its underlying implementation."

Erl hints that there are greater costs in the attention each federated service demands at design time. Most of the development time in an over-federated suite of applications is the tedious work of writing yet more mapping code and config that allows the interoperability.

What this threshold must be assessed on a case-by-case basis. As ever with architecture, your mileage my vary.

Wednesday, October 17, 2012

System Properties and the KeyManager

Setting fields using System.setProperty(...) is nasty as it's a global variable by any other name. But sometimes you have to tolerate it when dealing with other people's libraries.

One such library is Java's security classes where you can define your key store file by setting system property javax.net.ssl.keyStore. But be warned: you can only use it once. After that, it is set for the lifetime of the JVM.

If you take a peek into DefaultSSLContextImpl.getDefaultKeyManager(...), you'll see the static defaultKeyManagers being set (from a KeyManagerFactory initiated with a KeyStore that was loaded from the file in the system property). Subsequent calls to this method get the static reference.

This monkeyed around with our test suite no end as it was not immediately obviously where this was being set.

As it happened, an earlier test set it incorrectly but didn't exercise that path of the code. Later in the suite, we saw NoSuchAlgorithmException being thrown by an unrelated test. This is a misleading error.

"NoSuchAlgorithmExceptions are often cause by other underlying exceptions (file not found, wrong password, wrong keystore type, ...)." - from Bruno on StackOverflow.

And so it proved to be. The path the key store value was pointing to didn't exist.

[Miscellaneous: pictures of the SSL Handshake and the Wikipedia description are worth seeing.]

Server Keys

The KeyStore for a server can have many entries. The JavaDocs say:

Each entry in a keystore is identified by an "alias" string. In the case of private keys and their associated certificate chains, these strings distinguish among the different ways in which the entity may authenticate itself. For example, the entity may authenticate itself using different certificate authorities, or using different public key algorithms.

But which entry is passed back to the client? By stepping through the OpenJDK code, I came to this point:

Thread [main] (Suspended)
SunX509KeyManagerImpl.chooseServerAlias(String, Principal[], Socket) line: 271
ServerHandshaker.setupPrivateKeyAndChain(String) line: 1011
ServerHandshaker.trySetCipherSuite(CipherSuite) line: 879
SSLServerSocketImpl.checkEnabledSuites() line: 313
SSLServerSocketImpl.accept() line: 272
.
.

Where I saw this:

if ((aliases != null) && (aliases.length > 0)) {
return aliases[0];
}

The first alias is given to the client no matter what! And thus the private key on the ServerHandshaker object is set.

So, beware if you have other keys in your KeyStore as you might not be giving your client the right one.

Callooh callay! I have OpenJDK

I'm using OpenJDK because I want access to the JCE code to see why my system is behaving the way it is. Since the official JDK Reference Implementation for Java 7 is entirely OpenJDK [1], this seems reasonable.

Remembering I have OpenJDK on my Fedora VM (albeit 1.6), it was just a matter of downloading the source code like any other package. After doing so, I get this:

[phenry@localhost Code]$ rpm -qa | grep openjdk
java-1.6.0-openjdk-devel-1.6.0.0-65.1.10.8.fc15.i686
java-1.6.0-openjdk-1.6.0.0-65.1.10.8.fc15.i686
java-1.6.0-openjdk-src-1.6.0.0-65.1.10.8.fc15.i686

Hoorah! There it is. But where is the actual source file?

[phenry@localhost Code]$ rpm -ql java-1.6.0-openjdk-src-1.6.0.0-65.1.10.8.fc15.i686
/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0/src.zip
/usr/share/doc/java-1.6.0-openjdk-src-1.6.0.0
/usr/share/doc/java-1.6.0-openjdk-src-1.6.0.0/README.src

(Posting this because I keep forgetting all that RPM goodness [2]).

[1] Java Platform, Standard Edition 7 Reference Implementation
[2] A good crib sheet on RPM Commands.

Sunday, September 30, 2012

When buffering does not improve performance

We wrote a simple server earlier this month and were surprised how inefficient it was when under heavy load.

Following the Agile mantra of the simplistic thing that works, we used standard Java classes such as java.io.BufferedReader. To our surprise, this proved a very poor performer when compared to reading the raw InputStream. To illustrate, let's create an interface thus:

package com.henryp.io.stream;

import java.io.IOException;

public interface StreamReader {

public void readStream() throws IOException;

}

and have two implementing classes. The first is a simple implementation of BufferedReader:

package com.henryp.io.stream;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;

public class BufferedStreamReader implements StreamReader {

private final BufferedReader bufferedReader;

BufferedStreamReader(InputStream inputStream) {
Reader reader = new InputStreamReader(inputStream);
bufferedReader = new BufferedReader(reader);
}

public void readStream() throws IOException {
String line = bufferedReader.readLine();
while (line != null) {
line = bufferedReader.readLine();
}
}

}

The second is a little more sophisticated. It doesn't allow the creation of Strings but instead uses byte arrays. What's more, it caches these byte arrays so it can reuse them:

package com.henryp.io.stream;

import java.io.IOException;
import java.io.InputStream;

public class FixedBufferStreamReader implements StreamReader {

private static final int BUFFER_SIZE = 8192;

private final ThreadLocal buffers = new ThreadLocal();

private final InputStream inputStream;

FixedBufferStreamReader(InputStream inputStream) {
super();
this.inputStream = inputStream;
}

@Override
public void readStream() throws IOException {
byte[] buffer = getBufferForThread();
int read = inputStream.read(buffer);
while (read != -1) {
read = inputStream.read(buffer);
}
}

private byte[] getBufferForThread() {
byte[] buffer = buffers.get();
if (buffer == null) {
buffer = new byte[BUFFER_SIZE];
buffers.set(buffer);
}
return buffer;
}

}

Note, neither implementation actually parses the data. Obviously, using raw byte arrays will be harder but if you want better performance, you'll have to take that hit.

The test harness looks like this:

package com.henryp.io.stream;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicInteger;

public class StreamReaderStressTest {

private static final int CONTENT_SIZE = 10000;

private final int numThreads = 100;

private final int numIterations = 200;

private final AtomicInteger readerIndex = new AtomicInteger(-1);

private final StreamReader[] streamReaders;

private final CountDownLatch countDownLatch = new CountDownLatch(numThreads);

public static void main(String[] args) throws InterruptedException {
long initialMemory = memory();
System.out.println("Initializing...");
StreamReaderStressTest app = new StreamReaderStressTest();
long memoryBeforeRun = memory();
System.out.println("Sleeping...");
Thread.sleep(4000L);
app.runTest();
long memoryAfterRun = memory();
System.out.println("Memory consumed during setup: " + (initialMemory - memoryBeforeRun) + " bytes"
+ "\nChange in memory during test: " + (memoryBeforeRun - memoryAfterRun) + " bytes");
}

private static long memory() {
System.gc();
long freeMemory = Runtime.getRuntime().freeMemory();
System.out.println("Free memory: " + (freeMemory / (1024L*1024L)) + " MiB");
return freeMemory;
}

StreamReaderStressTest() {
InputStream[] inputStreams = initialize();
streamReaders = new StreamReader[inputStreams.length];
for (int i = 0 ; i < inputStreams.length ; i++) {
streamReaders[i] =
//new FixedBufferStreamReader(inputStreams[i]); // 13ms
new BufferedStreamReader(inputStreams[i]); // 367ms
}
}

private void runTest() {
System.out.println("About to start test...");
StreamReadingThread[] readers = new StreamReadingThread[numThreads];
Thread[] threads = new Thread[numThreads];
for (int i = 0 ; i < numThreads ; i++) {
readers[i] = new StreamReadingThread();
threads[i] = new Thread(readers[i]);
threads[i].start();
}
waitForThreadsToEnd(threads);
outputStats(readers);
}

private void outputStats(StreamReadingThread[] readers) {
long totalTime = 0;
for (int i = 0 ; i < readers.length ; i++) {
totalTime += readers[i].getDuration();
}
System.out.println("\nAverage time per thred " + (totalTime / (numThreads)) + "ms\n");
}

private void waitForThreadsToEnd(Thread[] threads) {
for (int i = 0 ; i < threads.length ; i++) {
try {
threads[i].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}

class StreamReadingThread implements Runnable {

private volatile long duration = 0;

@Override
public void run() {
getReady();
long start = System.currentTimeMillis();
try {
for (int i = 0 ; i < numIterations ; i++) {
streamReaders[readerIndex.incrementAndGet()].readStream();
}
duration = System.currentTimeMillis() - start;
} catch (IOException x) {
x.printStackTrace();
}
}

private void getReady() {
countDownLatch.countDown();
try {
countDownLatch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
}

long getDuration() {
return duration;
}
}

private InputStream[] initialize() {
int numStreams = numThreads * numIterations;
InputStream[] inputStreams = new InputStream[numStreams];
String stringForStream = createArbitraryString(CONTENT_SIZE);
for (int i = 0 ; i < numStreams ; i++) {
inputStreams[i] = new ByteArrayInputStream(stringForStream.getBytes());
}
return inputStreams;
}

private String createArbitraryString(int length) {
StringBuffer szb = new StringBuffer();
for (int i = 0 ; i < length ; i++) {
szb.append("A");
}
return szb.toString();
}

}

Note:

All initialization is done up-front.
A 4 second pause is implemented before the test so there is no interference from biased locking being turned off at start up (see here for a full description).
Run the JVM with the arguments -Xmx1024m -Xmn1024m so the memory is fully initialized to begin with.

Over 5 runs each, the use of BufferedReader yielded a mean time of 1813ms whereas using the raw InputStream took a mean time of 380ms.

A clue for why this is the case is in the source code for BufferedReader. There is a liberal sprinkling of System.arraycopy().

Furthermore, the amount of memory used by the BufferedReaders was about 4 times greater than the InputStreams (roughly 700MB versus 185MB). This is not surprising when you look at the code for BufferedReader. It creates byte arrays on instantiation.

Greater efficiency using the raw InputStream was achieved by using a smaller size for the byte array. Reducing it to 1024 bytes gave a mean time of 217ms (although with a greater standard deviation).

Looking at the source code for GlassFish's Grizzly project, you can see ByteBuffers being cached to read the data from a SocketChannel (see the method org.glassfish.grizzly.nio.transport.TCPNIOUtils.readSimpleByteBuffer(...) for more information. The test org.glassfish.grizzly.http.server.HttpInputStreamsTest.testCharacter010() executes this code).

[Aside: to make Grizzly into an Eclipse project, you need to run:

mvn org.apache.maven.plugins:maven-eclipse-plugin:2.6:eclipse

See here for why.]

Saturday, September 8, 2012

Garbage Collection Rubbish

More memory leaks recently and again from various byte-code manipulators. A memory dump showed huge swathes of memory taken by classes whose names indicate that they were dynamically generated by CGLIB.

The problem manifested itself during integration test runs where we constantly start and stop Jetty servers. Since we start and stop Jetty programatically (see a previous post) and the way we used Spring seemed to dynamically generate many classes, these classes ultimately consumed all the memory because they were then be stored in static references - that is, they are stored with the classes. And these classes were not garbage collected since Spring's classloader was the same as that running the test.

"Types loaded through the bootstrap class loader will always be reachable and will never be unloaded. Only types loaded through user-defined class loaders can become unreachable and can be unloaded by the virtual machine." - Inside the Java 2 Virtual Machine, Bill Venners, p266.

If we'd deployed a WAR file presumably all generated classes would have been garbage collected when the web app was unloaded in the test's tear down method.

To demonstrate our problem, let's:

write our own class loader
use it to load a class
instantiate an instance of that class and let it consume lots of memory by storing objects in one of its static references.
allow the classloader to be garbage collected
see how the memory changed during all these steps

OK, so here's the class loader:

package com.henryp.classloader;

import java.io.IOException;
import java.io.InputStream;
import java.net.URL;

public class MyClassLoader extends ClassLoader {

private final String baseDir;

MyClassLoader(String baseDir) {
super();
this.baseDir = baseDir;
}

@Override
protected Class findClass(String filename) throws ClassNotFoundException {
filename = filename.replace(".", "/") + ".class";
InputStream resourceAsStream = null;

try {
URL url = new URL(baseDir + filename);
System.out.println("URL = " + url);
resourceAsStream = url.openStream();
System.out.println("Resource null? " + (resourceAsStream == null));
byte[] buffer = new byte[4096];
byte[] bytes = new byte[4096];
int read = resourceAsStream.read(buffer);
int total = 0;

while (read != -1) { // this could be a lot more efficient but it does the job
System.out.println("Read " + read + " bytes");
byte[] dest = new byte[total + read];
System.arraycopy(bytes, 0, dest, 0, total);
System.arraycopy(buffer, 0, dest, total, read);
bytes = dest;
total += read;
read = resourceAsStream.read(buffer);
}
return defineClass(null, bytes, 0, total);
} catch (IOException e) {
e.printStackTrace();
throw new ClassNotFoundException(filename);
} finally {
try {
resourceAsStream.close();
} catch (Throwable e) {
e.printStackTrace();
}
}
}
}

Not awfully pretty but it doesn't need to be terribly efficient.

The main class that uses our class loader looks like this (notice that we measure the memory before and after our class loader is eligible for garbage collection):

package com.henryp.classloader;

public class Main {

public static void main(String[] args) {
Main app = new Main();
app.doClassloading();
}

private void doClassloading() {
MyClassLoader classLoader = new MyClassLoader("file:/Users/phenry/Documents/workspace/TestOther/bin/");
consumeMemory(classLoader);
long beforeNull = gcThenLogMemory();
System.out.println("Making classloader eligible for garbage collection");
classLoader = null;
long afterNull = gcThenLogMemory();
System.out.println("Memory usage change: " + (afterNull - beforeNull) + " bytes liberated");
}

private void consumeMemory(MyClassLoader classLoader) {
try {
// filename from a project that depends on this but this project does not depend on that
Class clazz = classLoader.findClass("com.henryp.classloader.OtherStaticReferenceBloater");
consumeMemory(clazz);
} catch (Exception e) {
e.printStackTrace();
System.exit(-1);
}
}

private static long gcThenLogMemory() {
System.gc();
long freeMemory = Runtime.getRuntime().freeMemory();
System.out.println("After garbage collection, free memory = " + freeMemory);
return freeMemory;
}

private void consumeMemory(Class clazz) throws Exception {
LargeMemoryConsumable memoryConsumer = instantiateMemoryConsumer(clazz);
long initialMemory = Runtime.getRuntime().freeMemory();
doCallOnForeignObject(memoryConsumer);
System.gc();
long finalMemory = Runtime.getRuntime().freeMemory();
System.out.println("Initial memory = " + initialMemory);
System.out.println("Final memory = " + finalMemory);
System.out.println("Total consumed = " + (initialMemory - finalMemory));
}

private LargeMemoryConsumable instantiateMemoryConsumer(Class clazz)
throws InstantiationException, IllegalAccessException {
System.out.println("Class name: " + clazz.getName());
gcThenLogMemory();
LargeMemoryConsumable memoryConsumer = (LargeMemoryConsumable) clazz.newInstance();
checkDifferentClassLoader(clazz);
return memoryConsumer;
}

private void doCallOnForeignObject(LargeMemoryConsumable memoryConsumer) throws Exception {
for (int i = 0 ; i < 500000 ; i++) {
memoryConsumer.consumeMemory();
}
}

private void checkDifferentClassLoader(Class clazz) {
if (clazz.getClassLoader() == this.getClass().getClassLoader()) {
System.out.println("Experiment useless if the classloaders are the same");
System.exit(-1);
}
}
}

And the memory consumer we refer to is a simple interface:

package com.henryp.classloader;

public interface LargeMemoryConsumable {
public void consumeMemory() throws Exception;
}

Now, this is where it gets a little more interesting. In another project, we have an object that just takes up memory. But note that although this new project depends on the classes in our first project, the first project does not reference this new project. It does, however, use a URL that points at the directory that happens to be the directory into which the classes of the new project are compiled.

(I've added a toString method to stop the compiler optimizing away the aString member.)

package com.henryp.classloader;

public class MyOtherFatObject {

public static Object create() {
return new MyOtherFatObject("" + System.currentTimeMillis());
}

private final String aString;

private MyOtherFatObject(String aString) {
super();
this.aString = aString;
}

@Override
public String toString() {
return "MyOtherFatObject [aString=" + aString + "]";
}

}

In the same project, we have a class that emulates what we saw Spring doing, that is having a static reference to a collection of memory-consuming objects.

package com.henryp.classloader;

import java.util.ArrayList;
import java.util.Collection;

public class OtherStaticReferenceBloater implements LargeMemoryConsumable {

private static final Collection

public void consumeMemory() throws Exception {
if (this.getClass().getClassLoader() != MyOtherFatObject.class.getClassLoader()) {
throw new Exception("Different classloaders");
}
objects.add(MyOtherFatObject.create());
}

public void addObject(Object object) {
objects.add(object);
}

}

Running with these JVM args:

-verbose:gc -Xmx64m -Xmn64m

the output looks like:

URL = file:/Users/phenry/Documents/workspace/TestOther/bin/com/henryp/classloader/OtherStaticReferenceBloater.class
Resource null? false
Read 1300 bytes
Class name: com.henryp.classloader.OtherStaticReferenceBloater
After garbage collection, free memory = 63333768
URL = file:/Users/phenry/Documents/workspace/TestOther/bin/com/henryp/classloader/MyOtherFatObject.class
Resource null? false
Read 896 bytes
Initial memory = 63333768
Final memory = 15130184
Total consumed = 48203584
After garbage collection, free memory = 15130200
Making classloader eligible for garbage collection
After garbage collection, free memory = 66266816
Memory usage change: 51136616 bytes liberated

(Notice how we explicitly load one class but it implicitly pulls in another using our class loader, the same class loader that loaded it - see the lines beginning with URL = ...).

Noting the usual caveats that System.gc() is a suggestion to the JVM and not a guarantee that the garbage collector should run, making the reference to the class loader null seems to liberate all the objects stored in the static reference of the class it loaded.

What we finally did was make all references we had in our integration tests to Spring's context loaders non-static. As our tests were garbage collected, memory was freed.