Our network woes continue.
Coherence has an MBean (Coherence:type=Node,nodeId=XXX) that measures intra-cluster packet loss using its own proprietary protocol, TCMP. In our old environment, this showed a 99.9+ per cent success rate. In our new, it was touching about 50%
Coherence has a tool that measures packet loss within a network. Employing this tool in our new environment showed no problems. So, it seemed the problem wasn't the network.
The only way I could see the problem using this Datagram Tool was to set the buffers at unrealistically low buffer sizes. But we were using Coherence's defaults. What was the problem?
Well, thinking about it, what could cause the buffers to overflow? One answer is that nobody was emptying them. This could be because of the garbage collector.
"Large inbound buffers can help insulate the Coherence network layer from JVM pauses that are caused by the Java Garbage Collector. While the JVM is paused, Coherence cannot dequeue packets from any inbound socket. If the pause is long enough to cause the packet buffer to overflow, the packet reception is delayed as the originating node must detect the packet loss and retransmit the packet(s)." [1]
Indeed, our garbage collection times are much higher on the new, improved hardware. How odd. (This is the subject of another post).
Rather than measure these packet loss statistics dynamically through JConsole, you can also have Coherence print a report every so often. You can do this via the MBean with the object name Coherence:type=Reporter,nodeId=YYY where YYY is your node of choice. Configuration is fairly self-explanatory and it gives you the chance to plot performance over time.
[1] Coherence Documentation.
No comments:
Post a Comment