Wednesday, April 17, 2013

Going Atomic with Locks

You might have read about non-blocking algorithms, how the synchronized keyword is so 2012 and how things are much faster if you use atomics. Then, you might have written your own atomic-based locks and discovered that things are no better.

I certainly have. But thanks to this fine book I've learned not only to make them faster but why they're faster.

Test-and-set Locks

Let's take a typical Java lock:


    private ReentrantLock lock = new ReentrantLock();

    @Override
    public void unlock() {
        lock.unlock();
    }

    @Override
    public void lock() {
        lock.lock();
    }

and let's have 100 threads trying to attain the lock to increment an int 50000 time. Naively, I thought this would be efficient:



    private final AtomicBoolean state = new AtomicBoolean(false);
    
    public void lock() {
        while (state.getAndSet(true)) { }
    }


    public void unlock() {
        state.set(false);
    }


It was an order of magnitude slower. The questions are: why is it much slower and how do we improve it?


Herlihy and Shavit [1] call this a TASLock (a test-and-set lock). They describe why it's so slow:

"For simplicity, we consider a typical multiprocessor architecture in which processors communicate by a shared broadcast medium called a bus (like a tiny Ethernet). Both the processors and the memory controller can broadcast on the bus, but only one processor (or memory) can broadcast on the bus at the a time. All processors (and memory) can listen. Today, bus-based architecture are common because they are easy to build, although they scale poorly to large numbers of processors.


"Each processor has a cache, a small high-speed memory where the processor keeps data likely to be of interest. A memory access typically requires orders of magnitude more machine cycles than a cache access... When a processor reads from an address in memory, it first checks whether that address and its contents are present in the cache. If so, then the processor has a cache hit, and can load the value immediately. If not, the the processor has a cache miss, and must find the data either in the memory or in another processor's cache. The processor then broadcasts the address on the bus. The other processors snoop on the bus. If one processor has that address in its cache, then it responds by broadcasting the address and value. If no processor has that address, then the memory responds with the value at theat address.

"Each getAndSet() call is broadcast on the bus. Because all threads must use the bus to communicate with memory, these getAndSet() calls delay all threads, even those not waiting for the lock. Even worse, the getAndSet() call forces other processors to discard their own cached copies of the lock, so every spinning thread encounters a cache miss almost every time, and must use the bus to fetch the new, but unchanged value. Adding insult to injury, when the thread holding the lock tries to release it, it may be delayed because the bus in monopolized by the spinners." [1]

Using the Linux perf command shows the number of bus cycles and cache-misses:


[henryp@corsair Performance]$ perf stat -e cycles,bus-cycles,cache-misses,L1-dcache-load-misses,L1-dcache-store-misses,LLC-load-misses,LLC-store-misses  java -cp bin/ com.henryp.test.concurrent.spin.LockMain

 Performance counter stats for 'java -cp bin/ com.henryp.test.concurrent.spin.LockMain':

   564,815,071,755 cycles                    #    0.000 GHz                     [57.17%]
    16,620,454,218 bus-cycles                                                   [57.18%]
           832,399 cache-misses                                                 [57.19%]
       421,992,129 L1-dcache-misses                                             [57.17%]
        14,457,236 L1-dcache-misses                                             [57.14%]
           247,642 LLC-misses                                                   [57.11%]
           529,972 LLC-misses                                                   [57.14%]

      10.681204987 seconds time elapsed



where LLC is the Last Level Cache (L3 on my machine). Misses to this cache mean a call to RAM.

We'll compare these figures to the other strategies.

Test-and-test-and-set Locks

Herlihy and Shavit propose an improvement that looks something like this:


    private final AtomicBoolean state = new AtomicBoolean(false);
    
    @Override
    public void lock() {
        while (true) {
            while (state.get()) { };
            if (!state.getAndSet(true))
                return;
        }
    }    

    @Override
    public void unlock() {
        state.set(false);
    }


It is somewhat better but not by much (about 25%). The authors call this a TTASLock (test-and-test-and-set lock)

"The first time thread B reads the lock it takes a cache miss, forcing B to block while the value is loaded into B's cache. As long as A holds the lock, B repeatedly rereads the value, but hits in the cache every time. B thus produces no bus traffic, and does not slow down other threads' memory accesses. Moreover, a thread that releases a lock is not delayed by threads spinning on that lock.

"The situation deteriorates, however, when the lock is released. The lock holder releases the lock by writing false to the lock variable, which immediately invalidates the spinners' cached copies. Each one takes a cache miss, rereads the new value, and they all (more-or-less simultaneously) call getAndSet() to acquire the lock. The first to succeed invalidates the others, who must then reread the value, causing a storm of bus traffic. Eventually, the threads settled down once again to local spinning.

"The notion of local spinning, where threads repeatedly reread cached values instead of repeatedly using the bus, is an important principle critical to the design of efficient spin locks." [1]

With this strategy, perf gives figures like:


   412,122,456,391 cycles                    #    0.000 GHz                     [57.19%]
    12,132,558,770 bus-cycles                                                   [57.15%]
           783,803 cache-misses                                                 [57.15%]
       128,832,596 L1-dcache-misses                                             [57.14%]
        11,947,461 L1-dcache-misses                                             [57.16%]
           256,900 LLC-misses                                                   [57.18%]
           560,395 LLC-misses                                                   [57.17%]

       7.813546393 seconds time elapsed


So, although the LLC figures are comparable, there is less bus activity and fewer L1 misses as Herlihy and Shavit suggested.

Back off!

The solution is to implement a back off strategy (since vying for a highly contended lock may be a waste of time). The code the authors suggest looks a lot like the TTASLock, something like this:


    private final AtomicBoolean state = new AtomicBoolean(false);
    
    @Override
    public void lock() {
        while (true) {
            while (state.get()) { backOff(); };
            if (!state.getAndSet(true))
                return;
        }
    }

    @Override
    public void unlock() {
        state.set(false);
    }

    

    private final Random random = new Random();

    public void backOff() throws InterruptedException {
        int delay = random.nextInt(limit);
        limit = Math.min(maxDelay, 2 * (limit == 0 ? 1 : limit));
        Thread.sleep(delay);
    }



This actually makes it faster than Java's built in locks. The results look something like this:

JavaReentrantLock took 459 ms
TTASLockWithBackOff took 321 ms
TTASLock took 8932 ms
TASLock took 12600 ms

Repeatedly running the tests indicate that TTASLock with a backoff is the fastest by a margin of about 25%.

Perf gives figures for this strategy like:

       900,772,793 cycles                    #    0.000 GHz                     [62.65%]
        56,529,667 bus-cycles                                                   [64.05%]
           468,664 cache-misses                                                 [62.47%]
         9,728,368 L1-dcache-misses                                             [60.17%]
         6,039,764 L1-dcache-misses                                             [57.58%]
           173,197 LLC-misses                                                   [57.86%]
           573,249 LLC-misses                                                   [60.75%]
       0.465015535 seconds time elapsed

Now the differences are orders of magnitude. 

Quick Addendum

I initially tried to use Valgrind to analyse what was going on but saw very little difference between the figures (about 4% fewer cache-misses between the best and worst locks). But reading the documentation explains why:



"Valgrind serialises execution so that only one (kernel) thread is running at a time. This approach avoids the horrible implementation problems of implementing a truly multithreaded version of Valgrind, but it does mean that threaded apps never use more than one CPU simultaneously, even if you have a multiprocessor or multicore machine." [2]


[1] The Art of Multiprocessor Programming.

[2] http://valgrind.org/docs/manual/manual-core.html

Tuesday, April 9, 2013

Dumpster Diving in the JVM

If you're monitoring your application, you might notice the maximum amount of heap memory changing with time. How can this be? Surely the maximum is, well, the maximum and not a moving target?

The JavaDocs for the MemoryUsage defines the maximums as:

"the maximum amount of memory (in bytes) that can be used for memory management. Its value may be undefined. The maximum amount of memory may change over time if defined. The amount of used and committed memory will always be less than or equal to max if max is defined. A memory allocation may fail if it attempts to increase the used memory such that used > committed even if used <= max would still be true (for example, when the system is low on virtual memory)."

In Linux, it's easy to reserve more memory than you can use. For instance:


#include <sys/mman.h>
.
.
size_t sillyAmountOfMemory = 1024 * 1024 * 1024 * 1024; // over 1 terabyte!
printf("About to map %lld bytes\n", sillyAmountOfMemory);
addr = mmap(NULL, sillyAmountOfMemory, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
.
.

Meanwhile, in a shell:

[phenry@localhost MyMemoryTestsC]$ ps aux | head -1 ; ps aux | grep MyMemoryTest
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
phenry   11905  0.0  0.0 1050452  236 pts/0    S    13:43   0:00 /home/phenry/workspaceGanymedeC/MyMemoryTestsC/Debug/MyMemoryTests

Notice the VSZ value - that's the virtual size of our application:

[phenry@localhost MyMemoryTestsC]$ man ps | grep -A2 VSZ
.
.
       vsz         VSZ       virtual memory size of the process in KiB (1024-byte units). Device mappings are currently excluded; this is subject to change.


This is much more physical memory than I have on this machine:


[phenry@localhost MyMemoryTestsC]$ cat /proc/meminfo | head -1
MemTotal:        1026144 kB


Back to our Java process. Depending on the Garbage Collector being used, the maximum heap size may change to optimize performance. Oracle says:

"The statistics such as average pause time kept by the collector are updated at the end of each collection. The tests to determine if the goals have been met are then made and any needed adjustments to the size of a generation is made. The exception is that explicit garbage collections (e.g., calls to System.gc()) are ignored in terms of keeping statistics and making adjustments to the sizes of generations...

"If the maximum pause time goal is not being met, the size of only one generation is shrunk at a time. If the pause times of both generations are above the goal, the size of the generation with the larger pause time is shrunk first.

"If the throughput goal is not being met, the sizes of both generations are increased."

On a beefier machine, I can see the Parallel Scavenging collector being used. Oracle says of this type of collector:

"The parallel scavenge collector is similar to the parallel copying collector, and collects young generation garbage. The collector is targeted towards large young generation heaps and to scale with more CPUs. It works very well with large young generation heap sizes that are in gigabytes, like 12GB to 80GB or more, and scales very well with increase in CPUs, 8 CPUs or more. It is designed to maximize throughput in enterprise environments where plenty of memory and processing power is available.

"The parallel scavenge collector is again stop-the-world, and is designed to keep the pause down. The degree of parallelism can again be controlled. In addition, the collector has an adaptive tuning policy that can be turned on to optimize the collection. It balances the heap layout by resizing, Eden, Survivor spaces and old generation sizes to minimize the time spent in the collection. Since the heap layout is different for this collector, with large young generations, and smaller older generations, a new feature called "promotion undo" prevents old generation out-of-memory exceptions by allowing the parallel collector to finish the young generation collection."

(As an aside, there have been enhancements in JDK 7:

"The Parallel Scavenger garbage collector has been extended to take advantage of machines with NUMA (Non Uniform Memory Access) architecture. Most modern computers are based on NUMA architecture, in which it takes a different amount of time to access different parts of memory. Typically, every processor in the system has a local memory that provides low access latency and high bandwidth, and remote memory that is considerably slower to access.

"In the Java HotSpot Virtual Machine, the NUMA-aware allocator has been implemented to take advantage of such systems and provide automatic memory placement optimizations for Java applications. The allocator controls the eden space of the young generation of the heap, where most of the new objects are created. The allocator divides the space into regions each of which is placed in the memory of a specific node. The allocator relies on a hypothesis that a thread that allocates the object will be the most likely to use the object. To ensure the fastest access to the new object, the allocator places it in the region local to the allocating thread. The regions can be dynamically resized to reflect the allocation rate of the application threads running on different nodes. That makes it possible to increase performance even of single-threaded applications. In addition, "from" and "to" survivor spaces of the young generation, the old generation, and the permanent generation have page interleaving turned on for them. This ensures that all threads have equal access latencies to these spaces on average." [1])

Now, we connect to our Java process from another Java process and examine the first's MBeans with something like this:


        HashMap map = new HashMap();
        JMXConnector c = JMXConnectorFactory.newJMXConnector(createConnectionURL(host, port), map);
        c.connect();
        Object o = c.getMBeanServerConnection().getAttribute(new ObjectName("java.lang:type=Memory"), "HeapMemoryUsage");
        CompositeData cd = (CompositeData) o;

        Object max = cd.get("max");
.
.
.


This is after we have changed the C++ code in hotspot/src/share/vm/services/management.cpp thus:

// Returns a java/lang/management/MemoryUsage object representing
// the memory usage for the heap or non-heap memory.
JVM_ENTRY(jobject, jmm_GetMemoryUsage(JNIEnv* env, jboolean heap))
.
.
.
        printf("PH: MemoryPool %s has %llu bytes\n", pool->name(), u.max_size());
.
.
        printf("PH: so the total is %llu bytes\n", total_max);


We see output from the JVM that looks like:

PH: MemoryPool PS Survivor Space has 65536 bytes
PH: PSGenerationPool::get_memory_usage: size 1431699456: 
PH: MemoryPool PS Old Gen has 14316601344 bytes
PH: so the total is 21473656832 bytes

when we hit it with the above MBean code. The JVM being monitored has the command line switch -Xmx20g.

Now, when this monitored process starts consuming large amounts of memory, our Java code that is monitoring it prints out:

Tue Apr 09 22:59:42 BST 2013: committed = 18403360768 (17550 mb, 17972032kb), max = 19088801792 (18204 mb, 18641408kb)
Tue Apr 09 22:59:43 BST 2013: committed = 20726546432 (19766 mb, 20240768kb), max = 20726546432 (19766 mb, 20240768kb)

(notice how the maximum changes) and the output from the JVM being monitored indicates a change:

PH: PSYoungGen::resize_spaces
.
.
.
PH: MemoryPool PS Survivor Space has 1179648 bytes
PH: PSGenerationPool::get_memory_usage: size 1431699456: 
PH: MemoryPool PS Old Gen has 14316601344 bytes
PH: so the total is 21473722368 bytes

So the maximum is only the maximum until the next one :-)

Sunday, April 7, 2013

The Case of the Missing Memory

If you start Java with -Xmx512m, you'd expect the max heap size to be 512 megs, right? So, when you connect via JConsole (or JMX) why is the heap size smaller? My JConsole shows a committed and maximum size of the "Heap Memory Usage" as about 491MiB.

I've been fiddling with the OpenJDK source and been putting log statements into it to see how the GC is behaving. So, for instance, I changed:

hotspot/src/share/vm/gc_implementation/parallelScavenge/parallelScavengeHeap.cpp

added:

#include <stdio.h>

and liberally sprinkled the code with statements like:


  printf("PH: ParallelScavengeHeap::initialize: size %d: \n", _reserved.byte_size());

And sure enough, when I run my application, I see the JVM spit out:



PH: ParallelScavengeHeap::initialize: size 536870912: 

So, it's using the collector that is the "default on certain machine types" [1] with the full 512MiB allocated to it. This collector is "targeted towards large young generation heaps and to scale with more CPUs. It works very well with large young generation heap sizes that are in gigabytes, like 12GB to 80GB or more, and scales very well with increase in CPUs, 8 CPUs or more" [2] (I have 16 CPUs and 64GBs, although obviously have not set my heap that high).

Now, when I connect to the JVM and look at the memory usage, I see statements like:


PH: EdenMutableSpacePool::get_memory_usage: size 134217728: 
PH: SurvivorMutableSpacePool::get_memory_usage: size 22347776: 
PH: PSGenerationPool::get_memory_usage: size 357957632: 


since I have changed these C++ classes that can all be found in

hotspot/src/share/vm/services/psMemoryPool.cpp

(Note that these three classes are all the subtypes of CollectedMemoryPool for the PS collector.)

This memory usage totals the 491MiB usage we saw in JConsole but is still exactly 22 347 776 bytes short of the 512MiB we defined at start up time.

The "missing" memory is of course because there are two survivor spaces [3] but since only one has live objects at any point, we only count its size.

[1] Dr Richard Warburton's blog.
[2] Oracle's Improving Java Application Performance and Scalability by Reducing Garbage Collection Times and Sizing Memory Using JDK 1.4.1.
[3] Oracle's Virtual Machine Garbage Collection Tuning.