We've been given new hardware on which to run our application. To make sure it will be OK, we ran our performance tests against it. To our surprise, they sometimes did poorly despite the new CPUs being much better. Using jstack showed at any particular time, nothing much was going on.
First we looked at playing with OS parameters but we had no luck. Then we looked at esoteric things like traffic shaping but found nothing.
So, time to crack open Wireshark. We took a 60s sample of network traffic when things were going well as a baseline then sampled for 60s during slowness.
Although the application's functionality was fine during both samples, the TCP/IP traffic showed big differences. The slow traffic had lots of re-transmissions and "TCP ACKed lost segment" messages in the log.
To filter them in Wireshark, use:
tcp.analysis.lost_segment || tcp.analysis.retransmission
(Incidentally, another interesting filter is:
frame.time_delta > 0.1
which shows packets that had a gap between them of 0.1s - which is ages in the traffic world. Another is
tcp.window_size == 0
which shows your buffers filling up)
In our 60s capture on the client side, there were some 2 million frames captured. Of these, some 83 000 we re-transmissions or lost segments. This is 4% of our traffic and seems enough to slow our connections noticeably.
We've been assured by the hardware guys everything is fine. But we do share a VLAN (that is, we share physical networks with other teams) and that may be exhausted. This piece of kit is a Cisco Nexus 5k.
The mystery continues.