Agile Java Man: TCP

Showing posts with label TCP. Show all posts

Sunday, February 2, 2014

TCP Chatter

The netstat command lies. If you run it, you'll see packets larger than your MTU (maximum transmission unit - you can see this if you run the ifconfig command). This is because netstat is reporting what's happening at the OS level and "the operating system is passing packets larger than MTU to the network adapter, and the network adapter driver is breaking them up so that they fit within the MTU." [1]

This is called TCP Large Segment Offload (TSO or LSO) and is an optimization but makes seeing what is really going on harder. You can turn this off using the ethtool command [2].

Capturing the conversations between clients that are uploading 100MB of data to a server running JStringServer (work in progress), focussing on a particular conversation (by looking at the ports chosen in the exchange) and cleaning the data (with sed) such that data looks like:

34.565659 SERVER > CLIENT
34.606133 SERVER > CLIENT
34.606143 SERVER > CLIENT
.
.

(where the number is the time in seconds) for the server-to-client and

34.562933 CLIENT > SERVER
34.600927 CLIENT > SERVER
34.603579 CLIENT > SERVER
.
.

for the client-to-server allows me to generate a histogram of packets per time interval.

Using a little bit of the R-language:

server2Client <- read.table="" span="">"~/Documents/Temp/43645_server2Client.txt", header=FALSE, sep=" ")

client2Server <- read.table="" span="">"~/Documents/Temp/43645_client2Server.txt", header=FALSE, sep=" ")

numBreaks = 20

hist(c(client2Server$V1), breaks= numBreaks, col=rgb(1,0,0,0.5), xlab="time", main="Packets")

hist(c(server2Client$V1), breaks= numBreaks, col=rgb(0,0,1,0.5), xlab="time", main="Packets", add=TRUE)

yields this:

Packet Histogram. Red is client-to-server. Blue is server-to-client.

Red are the packets from the client uploading its data to the server. Blue is the server packets back to the client (note: these are almost entirely ACK packets). (Purple are the two superimposed on each other. Note, there is only one time interval where the number of server-to-client packets exceeds the client-to-server).

The upshot of all this is that TCP is terribly chatty. The server returns nothing but the string "OK" when the exchange is complete. But this doesn't stop it from sending lots of packets to the client ACKnowledging receipt of a chunk of data. That's a lot of redundant bandwidth.

Further Reading

Interesting paper on TCP burstiness here.

[1] Wireshark docs.
[2] Segmentation and Checksum Offloading - Steven Gordon.

Sunday, January 19, 2014

Further Adventures in TCP

TCP can be slow. This has lead to new protocols being invented. Oracle Coherence uses TCMP for data transmission (unlike Hazelcast where "communication among cluster members is always TCP/IP with Java NIO beauty" [1]). There are several implementations of UDT, a UDP based protocol, such as the ones found here (see Netty's use of the native libraries of Barchart-UDT in io.netty.testsuite.transport.udt.UDTClientServerConnectionTest).

ACK Knowledge

Why is TCP so slow? Well, first there is the overhead. Using tcpdump on the command line, I watched the communication between two boxes connected on the same LAN during a performance test. Here is the slightly edited output:

> sudo tcpdump -nn host 192.168.1.94 and 192.168.1.91 -i p7p1
.
.
21:59:46.105835 IP [CLIENT] > [SERVER] Flags [S], seq 3624779548, win 8192, options [mss 1460,nop,wscale 0,nop,nop,TS val 415343776 ecr 0,sackOK,eol], length 0
21:59:46.105842 IP [SERVER] > [CLIENT]: Flags [S.], seq 4258144914, ack 3624779549, win 2896, options [mss 1460,sackOK,TS val 8505455 ecr 415343776,nop,wscale 0], length 0
21:59:46.113288 IP [CLIENT] > [SERVER] Flags [.], ack 1, win 8688, options [nop,nop,TS val 415343783 ecr 8505455], length 0
21:59:46.113554 IP [CLIENT] > [SERVER] Flags [P.], seq 1:1025, ack 1, win 8688, options [nop,nop,TS val 415343783 ecr 8505455], length 1024
21:59:46.113559 IP [SERVER] > [CLIENT]: Flags [.], ack 1025, win 2896, options [nop,nop,TS val 8505463 ecr 415343783], length 0
21:59:46.113625 IP [CLIENT] > [SERVER] Flags [.], seq 1025:2473, ack 1, win 8688, options [nop,nop,TS val 415343783 ecr 8505455], length 1448
21:59:46.113843 IP [SERVER] > [CLIENT]: Flags [.], ack 2473, win 2896, options [nop,nop,TS val 8505463 ecr 415343783], length 0
21:59:46.120443 IP [CLIENT] > [SERVER] Flags [.], seq 2473:3921, ack 1, win 8688, options [nop,nop,TS val 415343790 ecr 8505463], length 1448
21:59:46.120695 IP [CLIENT] > [SERVER] Flags [.], seq 3921:5369, ack 1, win 8688, options [nop,nop,TS val 415343791 ecr 8505463], length 1448
21:59:46.120710 IP [SERVER] > [CLIENT]: Flags [.], ack 5369, win 2896, options [nop,nop,TS val 8505470 ecr 415343790], length 0
21:59:46.127322 IP [CLIENT] > [SERVER] Flags [.], seq 5369:6817, ack 1, win 8688, options [nop,nop,TS val 415343797 ecr 8505470], length 1448
21:59:46.127370 IP [CLIENT] > [SERVER] Flags [.], seq 6817:8265, ack 1, win 8688, options [nop,nop,TS val 415343797 ecr 8505470], length 1448
21:59:46.127588 IP [SERVER] > [CLIENT]: Flags [.], ack 8265, win 2896, options [nop,nop,TS val 8505477 ecr 415343797], length 0
21:59:46.132814 IP [CLIENT] > [SERVER] Flags [.], seq 8265:9713, ack 1, win 8688, options [nop,nop,TS val 415343804 ecr 8505477], length 1448
21:59:46.132817 IP [CLIENT] > [SERVER] Flags [P.], seq 9713:10011, ack 1, win 8688, options [nop,nop,TS val 415343804 ecr 8505477], length 298
21:59:46.133006 IP [SERVER] > [CLIENT]: Flags [.], ack 10011, win 2896, options [nop,nop,TS val 8505482 ecr 415343804], length 0
21:59:46.133170 IP [SERVER] > [CLIENT]: Flags [P.], seq 1:3, ack 10011, win 2896, options [nop,nop,TS val 8505483 ecr 415343804], length 2
21:59:46.133177 IP [SERVER] > [CLIENT]: Flags [R.], seq 3, ack 10011, win 2896, options [nop,nop,TS val 8505483 ecr 415343804], length 0
21:59:46.139801 IP [CLIENT] > [SERVER] Flags [.], ack 3, win 8686, options [nop,nop,TS val 415343809 ecr 8505483], length 0

The server to client communication is in blue and mostly consists of acknowledgements.

"The peer TCP must acknowledge the data, and as the ACKs arrive from the peer, only then can our TCP discard the acknowledged data from the socket send buffer. TCP must keep a copy of our data until it is acknowledged by the peer" [2].

Handshakes

If you take a look at the first three lines of that TCP dump, you'll see the 3-way handshake taking about 7 ms - that's about 20% of the overall call time. So, if you're connecting each time you talk to your server, you might want to keep the connection open and pool them.

There are moves afoot to exploit these packets to carry application data [4].

Slow starters

This in itself may not be enough. If a connection has been idle for a sufficiently large amount of time, something called Slow-Start Restart [3] may occur. This may depend on your kernel settings.

First, what is Slow-Start?

"The only way to estimate the available capacity between the client and the server is to measure it by exchanging data, and this is precisely what slow-start is designed to do. To start, the server initializes a new congestion window (cwnd) ... The cwnd variable is not advertised or exchanged between the sender and receiver... Further, a new rule is introduced: the maximum amount of data in flight (not ACKed) between the client and the server is the minimum of the rwnd and cwnd variables... [We aim] to start slow and to grow the window size as the packets are acknowledged: slow-start!" [3]

The size of this variable on my system is given by:

[henryp@corsair Blogs]$ grep -A 2 initcwnd `find /usr/src/kernels/3.6.10-2.fc17.x86_64/include -type f -iname '*h'`
/usr/src/kernels/3.6.10-2.fc17.x86_64/include/net/tcp.h:/* TCP initial congestion window as per draft-hkchu-tcpm-initcwnd-01 */
/usr/src/kernels/3.6.10-2.fc17.x86_64/include/net/tcp.h-#define TCP_INIT_CWND 10
/usr/src/kernels/3.6.10-2.fc17.x86_64/include/net/tcp.h-

The receive window size is exchanged in the packet headers [5].

Cache Connections as a Solution

Perhaps the obvious response is to cache the connections on the client side. But beware of slow-start restart "which resets the congestion window of a connection after it has been idle for a defined period of time". [3]

On a Linux box, this can be checked with:

sysctl net.ipv4.tcp_slow_start_after_idle

where an output of 1 indicates that it this functionality is turned on. It is recommended that this is turned off if you want your client to cache connections [3].

Dropped Packets

Don't worry about dropped packets in TCP - they are essential. "In fact, packet loss is necessary to get the best performance from TCP! " [3]

[1] Hazelcast Community Edition.
[2] Unix Network Programming, p58, Stevens et al.
[3] High Performance Browser Networking - O'Reilly.
[4] TCP Fast Open: expediting web services - lwn.net
[5] Spy on Yourself with tcpdump

Sunday, November 24, 2013

Bufferbloat: less is more

You would have thought that increasing buffer sizes was always a good thing, right? Wrong.

You would have thought that reducing load on a system would always make it faster, right? Also wrong.

When stress testing our code in an Oracle lab in Edinburgh, we noticed that increasing the load on the system increased throughput. Independently, on totally different software (nothing in common other than it's written in Java and some of it's running on Linux) I saw the same thing on my home network.

In both cases, a large network buffer size and low load was the problem. At home, I saw this:

Initiated 7855 calls. Calls per second = 846. number of errors at client side = 0. Average call time = 81ms

Initiated 9399 calls. Calls per second = 772. number of errors at client side = 0. Average call time = 89ms

Initiated 10815 calls. Calls per second = 708. number of errors at client side = 0. Average call time = 96ms

etc until I started a second machine hitting the same single-threaded process whereupon performance shot up:

Initiated 18913 calls. Calls per second = 771. number of errors at client side = 0. Average call time = 107ms

Initiated 21268 calls. Calls per second = 1177. number of errors at client side = 0. Average call time = 105ms

Initiated 24502 calls. Calls per second = 1617. number of errors at client side = 0. Average call time = 99ms

Initiated 29802 calls. Calls per second = 2650. number of errors at client side = 0. Average call time = 88ms

Initiated 34192 calls. Calls per second = 2195. number of errors at client side = 0. Average call time = 82ms

Initiated 39558 calls. Calls per second = 2683. number of errors at client side = 0. Average call time = 77ms

How odd - more load on the server means better throughput.

I was browsing the subject of bufferbloat on various websites including Jim Getty's excellent blog [1] where he writes extensively on the topic. He says:

"... bloat occurs in multiple places in an OS stack (and applications!). If your OS TCP implementation fills transmit queues more than needed, full queues will cause the RTT to increase, etc. , causing TCP to misbehave."

Inspired by this, I added to my code:

serverSocketChannel.setOption(
SO_RCVBUF,
4096);

before binding the channel to an address and the problem went away (the default value for this option was about 128kb on my Linux box).

Note that although this looks like a very small number, there is no fear of a buffer overrun.

"The TCP socket received buffer cannot overflow because the peer is not allowed is not allowed to send data beyond the advertised window. This is TCP's flow control" [2].

Curious to see why reducing the buffer size helps things, I tried sizes of 512, 1024, 2048 and so on until 65536 bytes while running

sudo tcpdump -nn -i p7p1 '(tcp[13] & 0xc0 != 0)'

which according to [3] should show me when the network experiences congestion (p7p1 is the name of my network interface, by the way).

The first value for SO_RCVBUF at which poor initial performance is encountered was 8192 bytes. Interestingly, as soon as the second client started hitting the server, tcpdump started spewing output like:

17:54:28.620932 IP 192.168.1.91.59406 > 192.168.1.94.8888: Flags [.W], seq 133960115:133961563, ack 2988954847, win 33304, options [nop,nop,TS val 620089208 ecr 15423967], length 1448
17:54:28.621036 IP 192.168.1.91.59407 > 192.168.1.94.8888: Flags [.W], seq 4115302724:4115303748, ack 2823779942, win 33304, options [nop,nop,TS val 620089208 ecr 15423967], length 1024
17:54:28.623174 IP 192.168.1.65.51628 > 192.168.1.94.8888: Flags [.W], seq 1180366676:1180367700, ack 1925192901, win 8688, options [nop,nop,TS val 425774544 ecr 15423967], length 1024
17:54:28.911140 IP 192.168.1.91.56440 > 192.168.1.94.8888: Flags [.W], seq 2890777132:2890778156, ack 4156581585, win 33304, options [nop,nop,TS val 620089211 ecr 15424257], length 1024

What can we make of this? Well, it appears that the bigger the buffer, the longer a packet can stay in the receiver's queue as Getty informs us [1]. The longer it stays in the queue, the longer the round trip time (RTT). The longer the RTT, the worse the sender thinks the congestion is as it doesn't differentiate between time lost on the network and time stuck in a bloated stupid FIFO queue. (The RTT is used in determining the congestion [4])

Given a small buffer, the receiver will, at a much lower threshold, tell the sender not to transmit any more packets [2]. Thus the queue is smaller and less time is spent in it. As a result, the RTT is low and the sender believes the network to be congestion-free and is inclined to send more data.

Given a larger buffer but with greater competition for resources (from the second client), the available space in the buffer is reduced so it things look very similar to the client as described in the previous paragraph.

It appears that the Linux community are wise to this and have taken countermeasures [5].

[1] JG's Ramblings.
[2] Unix Network Programming, p58, Stevens et al, p207
[3] Wikipedia.
[4] RFC 5681.
[5] TCP Small Queues.

To Nagle or Not

Some TCP properties Java programmers have control over, others they don't. One optimisation is in Socket.setTcpNoDelay, which is to do with Nagle's algorithm (note: calling setTcpNoDelay with true turns the algorithm off). Basically, this tells the OS to batch your packets.

When you should turn it on or off depends very much on what you are trying to do [1]. Jetty sets no delay to true (that is, turns the algorithm off):

Phillips-MacBook-Air:jetty-all-8.1.9.v20130131 phenry$ grep -r setTcpNoDelay .
./org/eclipse/jetty/client/SelectConnector.java: channel.socket().setTcpNoDelay(true);
./org/eclipse/jetty/client/SocketConnector.java: socket.setTcpNoDelay(true);
./org/eclipse/jetty/server/AbstractConnector.java: socket.setTcpNoDelay(true);
./org/eclipse/jetty/server/handler/ConnectHandler.java: channel.socket().setTcpNoDelay(true);
./org/eclipse/jetty/websocket/WebSocketClient.java: channel.socket().setTcpNoDelay(true);

Playing around with my own simple server, I experimented with this value. I set up a single thread on a 16-core Linux box using Java NIO that services requests sent from 2 MacBooks that each had 100 threads using normal, blocking IO and sending 10 010 bytes of data (the server replies with a mere 2 bytes).

Setting the algorithm on or off on the server made no discernible difference. Not surprising as 2 bytes are (probably) going to travel in the same packet. But calling socket.setTcpNoDelay(false) on the clients showed a marked improvement. Using a MacBook Pro (2.66GHz, Intel Core 2 Duo) as the client, the results looked like:

socket.setTcpNoDelay(true)

Mean calls/second: 3884
Standard Deviation: 283
Average call time (ms): 29

socket.setTcpNoDelay(false)

Mean calls/second: 5060
Standard Deviation: 75
Average call time (ms): 20

Your mileage my vary.

The big difference was the time it took to call SocketChannel.connect(...). This dropped from 20 to 13 ms.

As an aside, you can see Linux's network buffers filling up with something like:

[henryp@corsair ~]$ cat /proc/net/tcp | grep -i 22b8 # where 22b8 is port 8888 on which I am listening
2: 5E01A8C0:22B8 00000000:0000 0A 00000000:00000012 02:0000000D 00000000 1000 0 1786995 2 ffff880f999f4600 99 0 0 10 -1
3: 5E01A8C0:22B8 4101A8C0:F475 03 00000000:00000000 01:00000062 00000000 1000 0 0 2 ffff880f8b4fce80
4: 5E01A8C0:22B8 4101A8C0:F476 03 00000000:00000000 01:00000062 00000000 1000 0 0 2 ffff880f8b4fcf00
.
.
76: 5E01A8C0:22B8 5B01A8C0:D3A5 01 00000000:0000171A 00:00000000 00000000 1000 0 4035262 1 ffff880fc3ff8700 20 3 12 10 -1
77: 5E01A8C0:22B8 5B01A8C0:D3AD 01 00000000:0000271A 00:00000000 00000000 0 0 0 1 ffff880fc3ffb100 20 3 12 10 -1
78: 5E01A8C0:22B8 5B01A8C0:D3B4 01 00000000:00000800 00:00000000 00000000 0 0 0 1 ffff880e1216bf00 20 3 12 10 -1
79: 5E01A8C0:22B8 5B01A8C0:D3A8 01 00000000:0000271A 00:00000000 00000000 0 0 0 1 ffff880fc3ff9500 20 3 12 10 -1
80: 5E01A8C0:22B8 5B01A8C0:D3AC 01 00000000:0000271A 00:00000000 00000000 0 0 0 1 ffff880fc3ffe200 20 3 12 10 -1
81: 5E01A8C0:22B8 5B01A8C0:D3B3 01 00000000:0000271A 00:00000000 00000000 0 0 0 1 ffff880e12169c00 20 3 12 10 -1
82: 5E01A8C0:22B8 4101A8C0:F118 01 00000000:00000000 00:00000000 00000000 1000 0 4033066 1 ffff880e1216d400 20 3 0 10 -1

Note 271A is 10010 - the size of our payload.

[1] ExtraHop blog.

Agile Java Man