## Monday, January 27, 2014

### Sneaky Bit-Based hack

I came across this bit of code in java.util.Random.nextInt and was impressed by its beauty:

if ((n & -n) == n)  // i.e., n is a power of 2

How does this work? Well, we have to look at how negative numbers are represented. Java uses two's compliment.

import static java.lang.Integer.toBinaryString;
import static java.lang.String.format;
.
.
public static void main(String[] args) {
for (int i = 0 ; i < 32 ; i++) {
System.out.println(format("%1\$4d ", i) + formattedBinaryString(i));
System.out.println(format("%1\$4d ", -i) + formattedBinaryString(-i));
}
}

private static String formattedBinaryString(int n) {
String binaryString = toBinaryString(n);
return format("%1\$32s", binaryString);
}

gives an output of:

0        0
0        0
1        1
-1 11111111
2       10
-2 11111110
3       11
-3 11111101
4      100
-4 11111100
5      101
-5 11111011
6      110
-6 11111010
7      111
-7 11111001
8     1000
-8 11111000
9     1001
-9 11110111
10     1010
-10 11110110
11     1011
-11 11110101
12     1100
-12 11110100
13     1101
-13 11110011
14     1110
-14 11110010
15     1111
-15 11110001
16    10000
-16 11110000

Now, there's many ways of working out the binary representation in your head. Here's two:

1. given -1 is 11111111 (truncated), find the difference between your negative number and take those 1s away. So, say you were thinking of -5, then then the difference is 4 (= -1 - (-5)), so take 100 from the binary representation of -1 (ie, the 3rd 1 from the left) and you get 11111011. Just as above.

2. Moving from left to right of the binary representation of the positive number, find the first 1. Leave that alone and continue but this time flipping 1s to 0s and 0s to 1s.

The upshot of this latter way of doing it is that only powers of 2 have one bit and it's unchanged when converting to negative numbers.

(NB, for completeness, this algorithm does not work for 0. The negative for 0 in two's compliment is the same as just 0).

This reminds me of a bit trick a neck-bearded UNIX friend told me. How do you swap the values of integers i and j without using any other variable? The answer involves XOR:

int i = 17, j =29;

i = j ^ i;
j = i ^ j;
i = j ^ i;

// now i = 29, j = 17

## Saturday, January 25, 2014

### What do you mean by memory?

If you set the JVM memory arguments Xmx and Xms, what are you actually doing? You're setting the heap size right? So, if I run 4 java processes using 5gb each on a machine with, say, 8gb, I've maxed the memory, right? Well, not necessarily.

This was the output from top for a Linux box running a dozen or so JVMs with Xmx4g:

top - 17:31:05 up 187 days, 12:24,  6 users,  load average: 1.75, 1.20, 0.88
Tasks: 293 total,   1 running, 292 sleeping,   0 stopped,   0 zombie
Cpu(s):  5.6%us,  0.5%sy,  0.0%ni, 93.7%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:     48274M total,    29096M used,    19178M free,      439M buffers
Swap:     8191M total,      990M used,     7201M free,     4889M cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28448 mds       17   0 5325m 4.8g 6600 S    1 10.1 502:39.47 java
29368 mds       17   0 4718m 1.8g 5668 S    0  3.8  74:00.60 java
16996 mds       17   0 4711m 1.7g  15m S   10  3.5   1:40.33 java
15366 mds       17   0 4661m 1.5g  15m S    0  3.3   0:59.42 java
15365 mds       17   0 4654m 1.5g  15m S    0  3.2   0:54.01 java
16997 mds       17   0 4711m 1.5g  15m S    0  3.1   1:01.36 java
.

The RESident space is for all but one less than 4gb. The VIRTual memory is over 4gb. Which do we trust?

First, what is RESident and VIRTual memory?

"A virtual memory scheme splits memory used by each program into small, fixed-size units called pages... At any one time, only some of the pages of a program need to be resident in physical memory page frames; these pages for the so-called resident set. Copies of the unused pages of a program are maintained in the swap area - a reserved area of disk space used to supplement the computer's RAM - and loaded into physical memory only as required." [1]

But this machine is barely using swap space. So, the missing memory is not there.

"When your JVM starts it reserves your maximum heap size as one continuous block of virtual memory on startup. However, only the pages actually used become resident pages (actual main memory). When you set the minimum size, it doesn't force the JVM to use that much memory if it doesn't need it." [2]

"Address space (ie. virtual memory in the process list) doesn't cost anything; it's not real. What's real is the RSS (RES) column, which is resident memory." [3]

So, we're setting something that is "not real". What's the point? Well, we're setting logical memory limitations, not physical. When a garbage collection is started and the size of the different areas of the garbage collector (eden, old etc) is a function of this parameter.

This is all became relevant for us this week because one of our boxes went wild and starting killing Java processes. Note: it is not necessarily the offender who is punished. The Linux OOM Killer [4] calculates who is the best process to kill using an algorithm that takes into account a number of variables (you can see who is most likely to die as the score is shown in the file /proc/PID/oom_score).

[1] The Linux Programming Interface, Kerrisk
[2] StackOverflow.
[3] ServerFault.
[4] Blog on the Linux OOM Killer.

### Java's sizeOf()

Java does have a C-like sizeOf() method. Sort of. It lives in the Instrumentation interface and is called getObjectSize. Note: it gives a shallow size but there are ways of getting the deep size of your object (see below).

You have to register your class as an agent which is a schlepp but not too hard. First, create a JAR which declares your class as its agent. In Maven, you can do this by adding the following your pom:

<build>
<plugins>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>2.4</version>
<configuration>
<archive>
<index>true</index>
<manifest>
</manifest>
<manifestEntries>
<Premain-Class>YOUR_CLASS_WITH_A_PREMAIN_METHOD</Premain-Class>
</manifestEntries>
</archive>
</configuration>
</plugin>
.
.

The class must have a static premain method that may look like this:

public static void premain(String agentArgs, Instrumentation inst) {
System.out.println("premain");
instrumentation = inst;
}

static Instrumentation instrumentation;

You then run your JVM with:

-javaagent:YOUR_JAR_FILENAME

and your premain method will run before your main. Note, you can have multiple javaagents on your command line if you want more than one agent.

The size of objects is JVM-dependent. For instance, we've upgraded to 64-bit JVMs and started getting out of memory exceptions. This was not too surprising as our references our now 64 bit not 32. Running this code:

System.out.println("new Hashmap(1) " + instrumentation.getObjectSize(new HashMap(1)));

outputs on a 64-bit JVM

new Hashmap(1) 80

and on a 32-bit JVM:

new Hashmap(1) 56

You can get the deep size of objects by using this library which just traverses the map of objects and adds up the sizes. You need to add the JAR of the SizeOf library as an agent to use it.

## Sunday, January 19, 2014

These correspond to three types of sockets in Java. ServerSocket is perhaps the most commonly used one as most people are somewhat familiar with TCP ("Indeed, TCP works with only unicast addresses, although UDP and raw IP support other paradigms" [1]). But there are also DatagramSockets and MulticastSockets.

First, a quick overview [1]:

Unicasting is where a process is talking to exactly one other process.

Broadcasting sends a datagram that all hosts on the attached subnet receive. The disadvantage is that every host on the subnet must process the datagram ... even if the host is not participating in the application.

A multicast datagram is sent to some hosts. Datagrams should be received only by those interfaces on hosts running applications wishing to participate in the multicast group.

(TCP is an example of a unicast protocol. ARP and DHCP use broadcasting or multicasting [1])

Sockets corresponding to DatagramSocket look like this in netstat:

mds@gbl04224[dev]:~> netstat -nap 2>/dev/null | grep 8888
udp        0      0 :::8888                 :::*                                29525/java

or

mds@gbl04224[dev]:~> netstat -nap 2>/dev/null | grep 8888
udp        0      0 0.0.0.0:8888            0.0.0.0:*                           32331/java

if they're using IPv4.

This is what you'd see if you ran:

DatagramSocket datagramSocket = new DatagramSocket(null);

using  -Djava.net.preferIPv6Stack=true and -Djava.net.preferIPv4Stack=true respectively.

Multicast sockets look the same. That is, if you run:

MulticastSocket multicastSocket = new MulticastSocket(6789);
multicastSocket.joinGroup(group);

it produces the same kind of output from netstat, although note: multiple processes can use the same multicast socket:

mds@gbl04224[dev]:~> netstat -nap 2>/dev/null | grep 6789
udp        0      0 :::6789                 :::*                                23912/java
udp        0      0 :::6789                 :::*                                23824/java

but not the same datagram socket:

mds@gbl04224[dev]:/opt/mds/storage/Phill/Java> java -cp . com.phenry.io.Multicast
at java.net.PlainDatagramSocketImpl.bind0(Native Method)
at java.net.PlainDatagramSocketImpl.bind(PlainDatagramSocketImpl.java:82)
at java.net.DatagramSocket.bind(DatagramSocket.java:368)
at com.phenry.io.Multicast.bindDatagramSocket(Multicast.java:21)
at com.phenry.io.Multicast.main(Multicast.java:14)

Run netstat with the -g flag to see group membership. Linux also tells you how many listeners there are. For instance, running the multicast code above yields:

\$ netstat -g
IPv6/IPv4 Group Memberships
Interface       RefCnt Group
--------------- ------ ---------------------
lo              1      all-systems.mcast.net
eth0            1      228.5.6.7

Running it a second time with the first process still running gives:

\$ netstat -g
IPv6/IPv4 Group Memberships
Interface       RefCnt Group
--------------- ------ ---------------------
lo              1      all-systems.mcast.net
eth0            2      228.5.6.7

[1] Unix Network Programming, Stevens et al

TCP can be slow. This has lead to new protocols being invented. Oracle Coherence uses TCMP for data transmission (unlike Hazelcast where "communication among cluster members is always TCP/IP with Java NIO beauty" [1]). There are several implementations of UDT, a UDP based protocol, such as the ones found here (see Netty's use of the native libraries of Barchart-UDT in io.netty.testsuite.transport.udt.UDTClientServerConnectionTest).

ACK Knowledge

Why is TCP so slow? Well, first there is the overhead. Using tcpdump on the command line, I watched the communication between two boxes connected on the same LAN during a performance test. Here is the slightly edited output:

> sudo tcpdump -nn host 192.168.1.94 and 192.168.1.91 -i p7p1
.
.
21:59:46.105835 IP [CLIENT] > [SERVER] Flags [S], seq 3624779548, win 8192, options [mss 1460,nop,wscale 0,nop,nop,TS val 415343776 ecr 0,sackOK,eol], length 0
21:59:46.105842 IP [SERVER] > [CLIENT]: Flags [S.], seq 4258144914, ack 3624779549, win 2896, options [mss 1460,sackOK,TS val 8505455 ecr 415343776,nop,wscale 0], length 0
21:59:46.113288 IP [CLIENT] > [SERVER] Flags [.], ack 1, win 8688, options [nop,nop,TS val 415343783 ecr 8505455], length 0
21:59:46.113554 IP [CLIENT] > [SERVER] Flags [P.], seq 1:1025, ack 1, win 8688, options [nop,nop,TS val 415343783 ecr 8505455], length 1024
21:59:46.113559 IP [SERVER] > [CLIENT]: Flags [.], ack 1025, win 2896, options [nop,nop,TS val 8505463 ecr 415343783], length 0
21:59:46.113625 IP [CLIENT] > [SERVER] Flags [.], seq 1025:2473, ack 1, win 8688, options [nop,nop,TS val 415343783 ecr 8505455], length 1448
21:59:46.113843 IP [SERVER] > [CLIENT]: Flags [.], ack 2473, win 2896, options [nop,nop,TS val 8505463 ecr 415343783], length 0
21:59:46.120443 IP [CLIENT] > [SERVER] Flags [.], seq 2473:3921, ack 1, win 8688, options [nop,nop,TS val 415343790 ecr 8505463], length 1448
21:59:46.120695 IP [CLIENT] > [SERVER] Flags [.], seq 3921:5369, ack 1, win 8688, options [nop,nop,TS val 415343791 ecr 8505463], length 1448
21:59:46.120710 IP [SERVER] > [CLIENT]: Flags [.], ack 5369, win 2896, options [nop,nop,TS val 8505470 ecr 415343790], length 0
21:59:46.127322 IP [CLIENT] > [SERVER] Flags [.], seq 5369:6817, ack 1, win 8688, options [nop,nop,TS val 415343797 ecr 8505470], length 1448
21:59:46.127370 IP [CLIENT] > [SERVER] Flags [.], seq 6817:8265, ack 1, win 8688, options [nop,nop,TS val 415343797 ecr 8505470], length 1448
21:59:46.127588 IP [SERVER] > [CLIENT]: Flags [.], ack 8265, win 2896, options [nop,nop,TS val 8505477 ecr 415343797], length 0
21:59:46.132814 IP [CLIENT] > [SERVER] Flags [.], seq 8265:9713, ack 1, win 8688, options [nop,nop,TS val 415343804 ecr 8505477], length 1448
21:59:46.132817 IP [CLIENT] > [SERVER] Flags [P.], seq 9713:10011, ack 1, win 8688, options [nop,nop,TS val 415343804 ecr 8505477], length 298
21:59:46.133006 IP [SERVER] > [CLIENT]: Flags [.], ack 10011, win 2896, options [nop,nop,TS val 8505482 ecr 415343804], length 0
21:59:46.133170 IP [SERVER] > [CLIENT]: Flags [P.], seq 1:3, ack 10011, win 2896, options [nop,nop,TS val 8505483 ecr 415343804], length 2
21:59:46.133177 IP [SERVER] > [CLIENT]: Flags [R.], seq 3, ack 10011, win 2896, options [nop,nop,TS val 8505483 ecr 415343804], length 0
21:59:46.139801 IP [CLIENT] > [SERVER] Flags [.], ack 3, win 8686, options [nop,nop,TS val 415343809 ecr 8505483], length 0

The server to client communication is in blue and mostly consists of acknowledgements.

"The peer TCP must acknowledge the data, and as the ACKs arrive from the peer, only then can our TCP discard the acknowledged data from the socket send buffer. TCP must keep a copy of our data until it is acknowledged by the peer" [2].

Handshakes

If you take a look at the first three lines of that TCP dump, you'll see the 3-way handshake taking about 7 ms - that's about 20% of the overall call time. So, if you're connecting each time you talk to your server, you might want to keep the connection open and pool them.

There are moves afoot to exploit these packets to carry application data [4].

Slow starters

This in itself may not be enough. If a connection has been idle for a sufficiently large amount of time, something called Slow-Start Restart [3] may occur. This may depend on your kernel settings.

First, what is Slow-Start?

"The only way to estimate the available capacity between the client and the server is to measure it by exchanging data, and this is precisely what slow-start is designed to do. To start, the server initializes a new congestion window (cwnd) ... The cwnd variable is not advertised or exchanged between the sender and receiver... Further, a new rule is introduced: the maximum amount of data in flight (not ACKed) between the client and the server is the minimum of the rwnd and cwnd variables... [We aim] to start slow and to grow the window size as the packets are acknowledged: slow-start!" [3]

The size of this variable on my system is given by:

[henryp@corsair Blogs]\$ grep -A 2 initcwnd `find /usr/src/kernels/3.6.10-2.fc17.x86_64/include  -type f -iname '*h'`
/usr/src/kernels/3.6.10-2.fc17.x86_64/include/net/tcp.h:/* TCP initial congestion window as per draft-hkchu-tcpm-initcwnd-01 */
/usr/src/kernels/3.6.10-2.fc17.x86_64/include/net/tcp.h-#define TCP_INIT_CWND 10
/usr/src/kernels/3.6.10-2.fc17.x86_64/include/net/tcp.h-

Cache Connections as a Solution

Perhaps the obvious response is to cache the connections on the client side. But beware of slow-start restart "which resets the congestion window of a connection after it has been idle for a defined period of time". [3]

On a Linux box, this can be checked with:

sysctl net.ipv4.tcp_slow_start_after_idle

where an output of 1 indicates that it this functionality is turned on. It is recommended that this is turned off if you want your client to cache connections [3].

Dropped Packets

Don't worry about dropped packets in TCP - they are essential. "In fact, packet loss is necessary to get the best performance from TCP! " [3]

[1] Hazelcast Community Edition.
[2] Unix Network Programming, p58, Stevens et al.
[3] High Performance Browser Networking - O'Reilly.
[4] TCP Fast Open: expediting web services - lwn.net
[5] Spy on Yourself with tcpdump

## Saturday, January 11, 2014

### Slow serialization between Java 6 and 7

When we upgraded all our systems to use Java 7 in the performance test environment, everything worked fine. But if the client remained as Java 6 and the Jetty server as Java 7 then the test harness slowed by an order of magnitude.

What gives? Nobody else seemed to be complaining about it when I did a Google search.

Well, with a combination of jstack and YourKit, we saw the problem was the bespoke classloader that the test harness used. It was synchronized on the loadClass method and had a lot of threads contending to access it. However, it was exactly the same classloader in both Java 6 and Java 7 configurations. Huh?

Indirectly, the problem was due to the classes that are new in Java 7. Putting some logging in our classloader showed that when running Java 6 it was constantly trying to load ReflectiveOperationException (and failing). This is a class new in JDK 7 and has been retro-fitted to be the superclass of common JDK classes such as ClassNotFoundException.

Our bespoke classloader allows hot deployment of new classes so if a class is not found, it tries to load it from somewhere on the network. But even if it doesn't find it, serialization continues as normal - just very slowly.

This is because Java serialization doesn't really care that classes are missing. It does a best-effort attempt to populate the object anyway.

The serialized bytes of the object still carries meta-data, something I missed when I ran this:

public static final String OBJ_SER = "String.ser";

public static void main(String[] args) throws IOException {
FileOutputStream fileOutputStream = new FileOutputStream(OBJ_SER);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(fileOutputStream);
objectOutputStream.writeObject("this is a test");
objectOutputStream.flush();
fileOutputStream.close();
}

This produced a file of a mere 21 bytes - most of it my text "this is a test". Replacing the line that writes the object to a stream with this:

objectOutputStream.writeObject(new ClassNotFoundException("this is a test"));

produce a monster 803 bytes. This is because java.lang.String is different. "There are special representations for null objects, new objects, classes, arrays, strings." [1]. Whereas, ClassNotFoundException is not special and serializes to something like this:

----sr--java-lang-ClassNotFoundException-Z-f-------L--ext--Ljava-lang-Throwable-xr--java-lang-ReflectiveOperationException-----------xr--java-lang-Exception-----------xr--java-lang-Throwable--5-9w-----L--causeq----L--detailMessaget--Ljava-lang-String---
stackTracet---Ljava-lang-StackTraceElement-L--suppressedExceptionst--Ljava-util-List-xppt--this-is-a-testur---Ljava-lang-StackTraceElement--F-----9---xp----sr--java-lang-StackTraceElementa----6-----I-
lineNumberL--declaringClassq----L--fileNameq----L-
methodNameq----xp----t--com-henryp-SerializationMaint--SerializationMain-javat--mainsr--java-util-Collections-UnmodifiableList---1-------L--listq----xr--java-util-Collections-UnmodifiableCollection-B---------L--ct--Ljava-util-Collection-xpsr--java-util-ArrayListx-----a----I--sizexp----w-----xq----xp

Where you can see it carries metadata showing the classes that make up its hierarchy.

This makes you wonder how efficient Java serialization is since all that metadata is sent with each object. There are alternatives - the open source Kryo being one, the proprietary POF format used in Oracle's Coherence being another (this blog [2] compares serialization libraries). With Coherence, you have to assign arbitrary but unique IDs to each class to be serialized and provide a class to perform the serialization such that your config may look something like this:

<user-type>
<type-id>2050</type-id>
<class-name>com.xx.yy.zz</class-name>
<serializer>
<class-name>something that implements com.tangosol.io.pof.PofSerializer</class-name>
</serializer>
</user-type>

Which can be laborious but allows very efficient de/serialization.