## Saturday, November 24, 2012

### Algorithm of the Week

In a week of cool algorithms, this one in java.util.PriorityQueue was the most interesting.

If you step through the code adding and remove-ing elements, you'll see the array represented by PriorityQueue.queue changing but it doesn't represent the order that elements come out of the queue. The documentation says the "Priority queue [is] represented as a balanced binary heap".

What is a binary heap?

"Definition: A binary heap is a collection of keys arranged in a complete heap-ordered binary tree, represented in level order in an array (not using the first entry).

"In a binary heap, the keys are stored in an array such that each key is guaranteed to be larger than (or equal to) two additional keys, and so forth.

"Definition: A binary tree is heap-ordered if the key in each node is larger than or equal to the keys in that node's two children (if any). Equivalently, the key in each node of a heap-ordered binary tree is smaller than or equal to the key in that node's parent (if any). Moving up from any node, we get a nondecreasing sequence of keys; moving down from any node, we get a nonincreasing sequence of keys. In particular... the largest key in a heap-ordered binary tree is found at the root." - Algorithms, Sedgewick & Wayne.

So, a tree that looks like this:

[Graphic taken from the Wikipedia entry]

maps isomorphically to an array, in this case 100, 19, 36, 17, 3, 25, 1, 2, 7. The rule is (from comments in PriorityQueue):

the two children of queue[n] are queue[2*n+1] and queue[2*(n+1)]

And "if multiple elements are tied for least value, the head is one of those elements -- ties are broken arbitrarily."

One last thing, the order of child nodes is irrelevant.

## Sunday, November 18, 2012

### Intrinsic Methods

If you copy and paste some of the JVM code into your own class you might be surprised that it doesn't run as fast as when it runs in its original library.

Hotspot does this by swapping Java code out and replacing it with hand-written code that is optimized to a particular architecture. Michael Barker shows on his blog how java.lang.Integer.bitCount() is replaced with an Intel ("Nehalem and later") instruction popcnt (this appears to be defined in jdk7/hotspot/src/cpu/x86/vm/x86_*.ad)

This surprising phenomena is due to some methods being intrinsic methods. Oracle's Performance Tuning wiki page is somewhat quiet on the subject but it does point in the direction of library_call.cpp and vmSymbols.hpp.

The classes that are mentioned as having intrinsics can be found with:

\$ grep do_intrinsic vmSymbols.hpp | perl -pe s/\#.*//g | perl -pe s/\\/\\/.*//g |  awk '{ print \$2 }' | sort | uniq | perl -pe s/_/./g | perl -pe s/,//g

java.lang.Boolean
java.lang.Byte
java.lang.Character
java.lang.Class
java.lang.Double
java.lang.Float
java.lang.Integer
java.lang.invoke.InvokeDynamic
java.lang.invoke.MethodHandle
java.lang.invoke.MethodHandleNatives
java.lang.Long
java.lang.Math
java.lang.Object
java.lang.reflect.Array
java.lang.reflect.Method
java.lang.ref.Reference
java.lang.Short
java.lang.String
java.lang.StringBuffer
java.lang.StringBuilder
java.lang.System
java.lang.Throwable
java.nio.Buffer
java.util.Arrays
sun.misc.AtomicLongCSImpl
sun.misc.Unsafe
sun.reflect.Reflection

[Aside: not all of them actually lead to intrinsics. I was looking at the intrinsic for java.nio.Buffer.checkIndex(...) but saw this in library_call.cpp:

case vmIntrinsics::_checkIndex:
// We do not intrinsify this.  The optimizer does fine with it.
return NULL;
]

But the java.lang.Math class does seem well represented in x86_64.ad where, for instance, the sin function is mapped to an Intel opcode rather than library calls.

### Direct Memory Access

We're currently profiling a Java application that features Lucene. There's some debate as to what is going on. Is our app performing badly because it is IO-bound? This would make sense as we are writing a lot of indexes to disk. So, why is the CPU usage so high?

Would a lot of IO trash the CPU? Kernel writer, Robert Love, suggests not:

"Given that the processors can be orders of magnitude faster than the hardware they talk to, it is not ideal for the kernel to issue a request and wait for a response from the significantly slower hardware. Instead, because the hardware is comparatively slow to respond, the kernel must be free to go and handle other work, dealing with the hardware only after that hardware has actually completed its work." [1]

In most (non-embedded) architectures, it appears that the CPU has very little to do with the heavy-lifting of data. What goes on is this (with a lot explanation stolen from Ulrich Drepper's What Every Programmer Should Know About Memory [2]):

The standardized chipset looks like this:

Let's define some terms:

• "All CPUs (two in the previous example, but there can be more) are connected via a common bus (the Front Side Bus, FSB) to the Northbridge.
• "The Northbridge contains, among other things, the memory controller.
• "The Southbridge often referred to as the I/O bridge, handles communication with devices through a variety of different buses".

The consequences are:

• "All data communication from one CPU to another must travel over the same bus used to communicate with the Northbridge.
• "All communication with RAM must pass through the Northbridge.
• "Communication between a CPU and a device attached to the Southbridge is routed through the Northbridge."

This is where direct memory access (DMA) comes in:

"DMA allows devices, with the help of the Northbridge, to store and receive data in RAM directly without the intervention of the CPU (and its inherent performance cost)." [2]

So, all our IO seems unlikely to be the cause of our CPU problems (caveat: we need to do more testing).

Out of interest, I was reading up on DMA and stumbled on this from the classic Operating Systems [3]:

1. "The CPU programs the DMA controller by setting its registers so it knows what to transfer where. It also issues a command to the disk controller telling it to read date from the disk into its internal buffer and verify the checksum. When valid data are in the disk controller's buffer, DMA can begin."
2. "The DMA controller initiates the transfer by issuing a read request over the bus to the disk controller."
3. "The disk controller fetches the next word from its internal buffer... The write to memory is another standard bus cycle."
4. "When the write is complete, the disk controller sends an acknowledgement to the [DMA controller], also over the bus. The DMA controller then increments the memory address to use and decrements the byte count."

"If the byte count is still greater than 0, steps 2 through 4 are repeated until the count reaches 0. At this point the controller causes an interrupt. When the operating system starts up, it does not have to copy the block to memory; it is already there".

[1] Linux Kernel Development - Rob Love
[2] What Every Programmer Should Know About Memory - Ulrich Drepper.
[3] Operating Systems - Tanenbaum & Woodhull

## Saturday, November 10, 2012

### Points on Safepoints

I've been coming across safepoints for over a year without really knowing what they are (you can see safepoints mentioned in the native code JIT produced in an old post). Despite a lot of Googling, I've not been able to find much.

When Gil Tene, CTO and co-founder of Azul Systems, was in town over summer, he gave a lecture on garbage collection and mentioned safepoints. The notes I took were:

• All commercial GCs use safepoints.
• The GC reigns in all threads at safepoints. This is when it has exact knowledge of things touched by the threads.
• They can also be used for non-GC activity like optimization.
• A thread at a safepoint is not necessarily idle but it often is.
• Safepoint opportunities should be frequent.
• All threads need to reach a global safepoint typically every dozen or so instructions (for example, at the end of loops).

Note that safepoints can stop all threads.

These rough notes jotted down are unsatisfactory so I went looking at the code for safepoints in OpenJDK. I'm far from the best C/C++ programmer in the world (caveat emptor), but these are the interesting things I discovered:

From safepoint.cpp:

// Begin the process of bringing the system to a safepoint.
// Java threads can be in several different states and are
// stopped by different mechanisms:
//
//  1. Running interpreted
//     The interpeter dispatch table is changed to force it to
//     check for a safepoint condition between bytecodes.
//  2. Running in native code
//     When returning from the native code, a Java thread must check
//     the safepoint _state to see if we must block.  If the
//     VM thread sees a Java thread in native, it does
//     not wait for this thread to block.  The order of the memory
//     writes and reads of both the safepoint state and the Java
//     threads state is critical.  In order to guarantee that the
//     memory writes are serialized with respect to each other,
//     the VM thread issues a memory barrier instruction
//     (on MP systems).  In order to avoid the overhead of issuing
//     a memory barrier for each Java thread making native calls, each Java
//     thread performs a write to a single memory page after changing
//     the thread state.  The VM thread performs a sequence of
//     mprotect OS calls which forces all previous writes from all
//     Java threads to be serialized.  This is done in the
//     os::serialize_thread_states() call.  This has proven to be
//     much more efficient than executing a membar instruction
//     on every call to native code.
//  3. Running compiled Code
//     Compiled code reads a global (Safepoint Polling) page that
//     is set to fault if we are trying to get to a safepoint.
//  4. Blocked
//     A thread which is blocked will not be allowed to return from the
//     block condition until the safepoint operation is complete.
//  5. In VM or Transitioning between states
//     If a Java thread is currently running in the VM or transitioning
//     between states, the safepointing code will wait for the thread to
//     block itself when it attempts transitions to a new state.
//

This point number 2 is particularly clever. The GC thread can protect some memory to which all threads in the process can write (using the mprotect system call) so they no longer can. Upon accessing this temporarily forbidden memory, a signal handler kicks in (see this previous post). Example code for this can be found on the man pages where it says:

"If the calling process tries to access memory in a manner that violates the protection, then the kernel generates a SIGSEGV signal for the process.

From what I can tell, this corresponds to the signal handling code in hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:

// Check to see if we caught the safepoint code in the
// process of write protecting the memory serialization page.
// It write enables the page immediately after protecting it
// so we can just return to retry the write.
if ((sig == SIGSEGV) &&
// Block current thread until the memory serialize page permission restored.
os::block_on_serialize_page_trap();
return true;
}

More safepoint notes to come.

### SOA Far, SOA Good

My quibble with SOA patterns is that they tend to be common sense and hardly worth documenting.

A colleague was talking about the Federation Pattern and when I asked what it was, it turns out that it's something we've been happily doing for years.

Thomas Erl describes it in SOA Principles of Service Design as:

"A federated IT environment is one where resources and application are united while maintaing their individual autonomy and self-governance. SOA aims to increase a federated perspective of an enterprise to whatever extent it is applied. It accomplishes this through the widespread deployment of standardized and composable services each of which encapsulates a segment of the enterprise and expresses it in a consistent manner.

"In support of increasing federation, standardization becomes part of the up-front attention each service receives at design time. Ultimately this leads to an environment where enterprise-wide solution logic becomes naturally harmonized, regardless of the nature of its underlying implementation."

Erl hints that there are greater costs in the attention each federated service demands at design time. Most of the development time in an over-federated suite of applications is the tedious work of writing yet more mapping code and config that allows the interoperability.

What this threshold must be assessed on a case-by-case basis. As ever with architecture, your mileage my vary.