Sunday, February 13, 2011

It's the environment, stupid!

During our recent release, an upgraded version of a library now pulled in MINA as a transitive dependency. Although there were no problems in our Linux QA environment, the Solaris QA environment just wouldn't work. There were no error messages, just our application saying that it was constantly trying to connect to another tier.

When the exact same code works in one environment but not in another, the problem simply has to be environmental. But proving it to your colleagues (who think the issue is with MINA) can be hard and finding what the exact problem is even harder.

If it's environmental, then we have to step away from Java (you can never live totally inside the virtual machine...) and look at what the actual machine is doing.

Let's write a test:
package com.henryp.nio;

import java.io.IOException;
import java.net.InetAddress;
import java.net.InetSocketAddress;
import java.net.ServerSocket;
import java.net.Socket;
import java.net.SocketAddress;
import java.nio.ByteBuffer;
import java.nio.channels.SocketChannel;
import java.util.concurrent.CountDownLatch;

import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;


public class SocketNioTest {

private static final int SERVER_PORT = 9999;
private static ServerSocket serverSocket;

@BeforeClass
public static void setUp() throws IOException {
serverSocket = new ServerSocket(SERVER_PORT);
Thread thread = new Thread(new Runnable() {

public void run() {
while (true) {
try {
Socket accept = serverSocket.accept();
System.out.println("client connected");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

});
}

@AfterClass
public static void tearDown() throws IOException {
serverSocket.close();
}

@Test
public void testNio() throws IOException, InterruptedException {
SocketChannel socketChannel = SocketChannel.open();

System.out.println("getLocalHost...");
InetAddress inetAddress = InetAddress.getLocalHost() ;
System.out.println("binding...");
SocketAddress bindpoint = new InetSocketAddress(inetAddress, SERVER_PORT);
System.out.println("connecting...");
socketChannel.socket().connect(bindpoint );

sendRandomData(socketChannel);

System.out.println("closing...");
socketChannel.socket().close();
}

private void sendRandomData(SocketChannel socketChannel) throws IOException {
ByteBuffer byteBuffer = ByteBuffer.allocate(256);
socketChannel.write(byteBuffer);
}

}

OK, it's not a real unit test (as it can't fail) but JUnit makes a nice test harness. Let's put some break points in and run tcpdump.

[root@vmwareFedoraII henryp]# tcpdump -i lo port 9999

Note that you need to be root to do this. Also note that you need to monitor the loopback interface. This is a very useful feature of Linux as you can then monitor packets that are passed around within the same process - packets that never actually leave the network card.

Now this line:

socketChannel.socket().connect(bindpoint );

causes tcpdump to print:

21:09:59.414367 IP localhost.localdomain.44985 > localhost.localdomain.distinct: Flags [S], seq 625354929, win 32792, options [mss 16396,sackOK,TS val 22409432 ecr 0,nop,wscale 6], length 0

21:09:59.414409 IP localhost.localdomain.distinct > localhost.localdomain.44985: Flags [S.], seq 630583436, ack 625354930, win 32768, options [mss 16396,sackOK,TS val 22409432 ecr 22409432,nop,wscale 6], length 0

21:09:59.414420 IP localhost.localdomain.44985 > localhost.localdomain.distinct: Flags [.], ack 1, win 513, options [nop,nop,TS val 22409432 ecr 22409432], length 0


This is the famous three-way handshake (SYN, SYN/ACK, ACK).

  1. It's saying that our client (that happens to be using port 44985 this is random) sends a SYN packet (Flags [S]) to the server socket with sequence number 625354929.
  2. The server responds with a a SYN/ACK (Flags [S] ... ack) with sequence number 630583436 acknowledging 625354930 (the clients sequence plus 1)
  3. Then the client acknowledges receipt of this with an ACK (ack).

This is a healthy normal exchange. Note that this is only one line of Java code. Even if you single step into the Java source code, you'll see just all this chatter taking place in one native call.

Now, we send some random data and it is this line:

socketChannel.write(byteBuffer);

that produces this output in tcpdump:


21:12:21.380171 IP localhost.localdomain.44985 > localhost.localdomain.distinct: Flags [P.], seq 1:257, ack 1, win 513, options [nop,nop,TS val 22551398 ecr 22409432], length 256

21:12:21.380220 IP localhost.localdomain.distinct > localhost.localdomain.44985: Flags [.], ack 257, win 529, options [nop,nop,TS val 22551398 ecr 22551398], length 0


This is the client sending a PUSH packet (Flags [P.]) of 256 bytes (remember our ByteBuffer.allocate(256)) and the server acknowledging receipt (ack).

seq 1:257 refers to the sequence number for this packet (relative to the first sequence number 625354929). If we drop the stack frame and run the lines in this method again, we see:

21:16:02.165005 IP localhost.localdomain.44985 > localhost.localdomain.distinct: Flags [P.], seq 257:513, ack 1, win 513, options [nop,nop,TS val 22772183 ecr 22551398], length 256

21:16:02.165014 IP localhost.localdomain.distinct > localhost.localdomain.44985: Flags [.], ack 513, win 546, options [nop,nop,TS val 22772183 ecr 22772183], length 0


and the sequence number for the client has moved to seq 257:513.

And finally:

socketChannel.socket().close();

produces:

21:21:48.188898 IP localhost.localdomain.44985 > localhost.localdomain.distinct: Flags [F.], seq 513, ack 1, win 513, options [nop,nop,TS val 23118207 ecr 22772183], length 0

21:21:48.228956 IP localhost.localdomain.distinct > localhost.localdomain.44985: Flags [.], ack 514, win 546, options [nop,nop,TS val 23118247 ecr 23118207], length 0


where the client says it's FINished and the server ACKnowledges this.

Actually, this is not the entire story. Look what happens when the JVM closes down:

21:26:26.173044 IP localhost.localdomain.distinct > localhost.localdomain.44985: Flags [R.], seq 1, ack 514, win 546, options [nop,nop,TS val 23396191 ecr 23118207], length 0

This is the client sending the RST packet. This says: whatever data is currently being processed in the TCP/IP stack, forget it. The JVM/OS is telling the server this to save it from worrying any more about the sockets that have been opened. Apparently, firewalls use these types of packets to stop people from connecting.

So, FINishing the TCP/IP is not sufficient to close the socket. In fact, if we run the test again until after we have closed the socket but before we close the JVM, we see something like this using netstat:


[henryp@vmwareFedoraII ~]$ netstat -nap | grep 9999
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 :::9999 :::* LISTEN 8099/java
tcp 257 0 ::ffff:127.0.0.1:9999 ::ffff:127.0.0.1:52088 CLOSE_WAIT -
tcp 0 0 ::ffff:127.0.0.1:52088 ::ffff:127.0.0.1:9999 FIN_WAIT2 -


And this state can stay around for some time (typically of the order of 10s of seconds, depending on the OS). Note that it's also disconnected from the Java process at this point as it appears to be totally under the control of the OS.

With RST, there is no waiting. And this is what brings me to what we were seeing in our QA environment. But the RST was being sent when the JVM was still alive and during the three-way handshake (SYN, SYN/ACK, RST... WTF??). This infers that the problem could not possibly be Java code since this is far too low a level.

It's easy to blame the new NIO/MINA as that was the only thing new in our stack but since Linux was handling it fine, we set the System Administrators looking at this Solaris box to see what could be wrong. Finally, by comparing all environment files with that of root who could run our code, we found that for the chroot-ed user under which our code runs could not write to /dev/poll. I don't pretend to be a Solaris expert but looking at this link:

"The /dev/poll pseudo-device [is] maintaining the full set of file descriptors being polled in the kernel."

Which sounds a little like (OK, I'm clutching at straws) the opening page on the chapter on Selectors in Java NIO:

"multiplexing make it possible for a single thread to efficiently manage many I/O channels simultaneously. C/C++ coders have had the POSIX select( ) and/or poll( ) system calls in their toolbox for many years."

Now, I know my MINA code is using selectors, so I am assuming that this asks the operating system to add an entry to /dev/poll. Since the thread finds this unexpectedly read-only (a mistake by the admin setting up the chroot jail) things go horribly wrong and a RST is sent just as we're connecting.

Whether this is true or not, chmoding /dev/poll to being read/write for this user solved the problem.

Poisoned Ivy

Stress tests are not just for your code. They're for the full stack.

We recently updated our build to use Ivy. This is no small piece of work and (as we discovered) can be prone to error.

Since we work in an investment bank, we're encouraged to use internal Ivy repositories. The developer tasked with this work could not find the JBossCache library we were using (1.2.3 - yes, the code really is that old) and so used version 1.2.2. He regarded this downgrade as the lesser of the evils as transitive dependency management seemed more important than a small difference in the revision number of a library. What could possibly go wrong?

We passed the QA process and deployed to production. A few days later, the app became unresponsive and kept puking this error:

org.jboss.cache.lock.TimeoutException: write lock for /tranchebook could not be acquired after 10000 ms

Why had this cut of the code passed QA?

I ran a soak test and within half an hour saw the same problem. So, I changed the version of the JBossCache back to 1.2.3 (which is what we have been using these last few years). After an hour, I still had not seen this exception. Cue a hasty, overnight deployment to production.

Looking at the JBossCache code, I saw the problem. If you put a break point in org.jboss.cache.interceptors.UnlockInterceptor in the invoke(MethodCall m) method at line:

List locks=(List)lock_table.get(Thread.currentThread());


With jboss-cache-1.2.3.jar, you’ll see that lock_table is a Collections$SynchronizedMap.

With jboss-cache-1.2.2.jar, you’ll see that lock_table is an implementation of HashMap – bad since neither this method nor the methods calling it hold a lock.

It’s easy to see that if multiple threads are changing a HashMap, we’re going to get non-deterministic behaviour (from the JavaDocs: "Note that this implementation is not synchronized. If multiple threads access this map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally.").


Version Number Methodology

Version numbers are less arbitrary than most developers imagine. Pressman says:

"Configuration object 1.0 undergoes revision and becomes object 1.1. Minor corrections and changes result in versions 1.1.1 and 1.1.2, which is followed by a major update that is object 1.2. The evolution of object 1.0 continues through 1.3 and 1.4, but at the same time, a major modification to the object results in a new evolutionary path, version 2.0."

(Software Engineering: A Practitioner's Approach, 4th Edition, p233.)

Exact convention is moot but a popular one is:

Major.Minor.Revision.Build

This is what Maven uses [1] , is similar to what GNU [2] uses and (apart from the use of an underscore) what Sun uses:

phillip:~ henryp$ java -version
java version "1.6.0_22"


Cleverly, Sun let their marketing people use the minor number in "soft" literature - hence the reason we often call JDK 1.5, Java 5.0 etc. Apparently, they did this for Solaris also.

The beauty of this system is that it helps me with compatibility. For instance, when moving from a wildcard based classpath to an Ivy managed classpath, we noticed that something like Xalan 2.7.1 and Xalan 2.1.0 were both being pulled in as transitive dependencies for other libraries we're using. Since we 're not using OSGi, we can't use both and as it happened Xalan 2.1.0 was appearing first in the classpath. This was causing a problem since another library needed methods in Xalan 2.7.1. What to do?

Since both Xalan's were 2.x, we could be fairly sure that something written for version 2.1.0 will work with version 2.7.1 since they share the same major number. Note: the opposite is not true - something written for 2.7.1 has no guarantee that it will work when running against 2.1.0. So, we removed 2.1.0 by hand confident that 2.7.1 would serve both libraries. So far so good.

This is similar to the JDK convention. Java compiled with JDK 1.0 will play nicely with JDK 1.6. Similarly, I don't expect massive differences between JDK 1.6.0_22 and 1.6.0_5 except bug fixes and performance improvements (see my previous post). Presumably, the JDK will only move to verison 2.x.y_z when deprecated methods like Thread.stop are removed as this constitutes an API change [3].

[1] http://mojo.codehaus.org/versions-maven-plugin/version-rules.html
[2] http://www.advogato.org/article/40.html
[3] http://www.javabeat.net/articles/101-introduction-to-java-module-system-in-java-70-2.html