Oracle's Coherence has a very nice reporting tool. If you don't have it running by default, you can connect via JConsole and turn it on. Give it a directory and it will print cluster statistics to it every minute.
The reason we need these stats is that on a quiet system, the percentage of packets dropped on some nodes can be terrible. I've been looking at why.
Oracle Coherence nodes talk to each other via UDP. Unlike TCP, nothing is ensuring our packets reach their destination. To counter this, Coherence builds its own protocol on top of UDP that resends packets if they have not been acknowledged or if the sender receives a NACK.
On a decent network, why would our packets not reach their destination? Well, one reason is that the receiving buffer is full. We've set our buffers at 2MB (per Oracle's advice) so this is unlikely but possible. If an application receives UDP packets faster than it can process them, packets will be dropped.
However, if you're running on Linux, you can easily see if this is the case.
henryp@phillsdell:~$ netstat -su
.
.
Udp:
3196 packets received
9 packets to unknown port received.
0 packet receive errors
2770 packets sent
It's the packet receive errors count that is of interest to us. Although the packet is discarded, Linux will at least make a note of its loss.
The mystery continues as we are seeing nothing like the packet loss here that would explain the number of Coherence re-sends.
Our cluster is saying that the number of resends ("when there is no ACK received within a timeout period" [1]) and the number resent early ("a packet is resent ahead of schedule when there is a NACK indicating that the packet has not been received" [1]) is high.
These resends appear necessary as the excess figure ("the total number of packet retransmissions which were later proven unnecessary" [1]) was low.
Note that you would not necessarily expect the publish success rate to correlate with receive success rate. They measure unrelated metrics. "Publisher success rate is a ratio of the number of packets successfully delivered in a first attempt to the total number of sent packets" [1]. Whereas the receiver rate is measures unnecessary redelivery. "Failure count is incremented when a re-delivery of previously received packet is detected. It could be caused by either very high inbound network latency or lost ACK packets." [2]
Our network guys assure us that our Cisco kit is not reporting anywhere near enough lost UDP packets to explain Coherence's need to resend packets, neither does the OS when reporting lost packets on the NICs.
What's more, put a load through the system does not change the absolute number of resends - therefore as a percentage of overall packets, this pathological metric drops. This is also the case when we put a lot of load on the network with the datagram tester that comes with Coherence.
We are perplexed.
[1] Coherence docs.
[2] Coherence docs.