Thursday, July 11, 2013

The Thrashing of the Garbage Collector

With memory running low, a Java process may grind almost to a halt. This can be due to thrashing in the garbage collector.

A handy utility

Before our JVMs died with OutOfMemoryErrors, I was looking at the GC activity. I used jstat to take a reading every 10s (the 10000 (ms) argument below) and got:

:~> jstat -gc 24519 10000 
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT 
7104.0 7104.0  0.0    0.0   131392.0 131392.0 1165120.0  1165119.9  35712.0 35327.1  12978  170.599 9002  30293.627 30464.226 
7104.0 7104.0  0.0    0.0   131392.0 131392.0 1165120.0  1165119.9  35712.0 35330.2  12978  170.599 9004  30300.773 30471.371 
7104.0 7104.0  0.0    0.0   131392.0 131392.0 1165120.0  1165120.0  35712.0 35329.4  12978  170.599 9007  30312.233 30482.832 
.
.

[24519 was the PID of the process]

FGCT (Full Garbage Collection Time [1]) was the interesting metric. At this point of time, full garbage collection was taking about 10s every 10s or so and the total garbage collection time (GCT) was almost entirely due to it (not the young generation).

[Compare this to a more healthy JVM that had been running for a day:

~> jstat -gc 22226 10000 
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT 
1216.0 1216.0 990.9   0.0   143168.0 27467.5  1165120.0  1088493.7  54016.0 35576.7   9227   89.841  10     15.148  104.989 
1216.0 1216.0  0.0   992.0  143168.0 46244.1  1165120.0  1088829.1  54016.0 35576.7   9228   89.850  10     15.148  104.997 
1216.0 1152.0 974.8   0.0   143232.0 76039.4  1165120.0  1089091.5  54016.0 35576.7   9229   89.856  10     15.148  105.004 

where only 15 seconds out of the last 24 hours has been due to a full garbage collection.]

Full garbage collection is a stop-the-world event [2, 3, 4] so this explains why the application's responsiveness becomes so poor. It may be worth keeping an eye on this figure.

Parallel Scavenge GC

The default garbage collector on our box was the Parallel Scavenger. I've talked about the Young Generations behaviour with this GC here. But the Old Generation GC has a different strategy. It's called PS MarkSweep. This uses the well known mark-and-sweep algorithm but also a compaction algorithm to stop the memory space looking like Swiss cheese [5]. This can be expensive.

What actually triggers an OOME?

What triggers an OOME is not necessarily there not being enough memory to instantiate an object. Amongst other reasons are "98%+ time is spent in GC" and "less than 2% heap is freed in a collection" [2]. This is not very deterministic.


[1] http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html
[2] QCon lecture
[3] http://www.fasterj.com/articles/oraclecollectors1.shtml
[4] http://apmblog.compuware.com/2011/03/10/major-gcs-separating-myth-from-reality/
[5] Dr Richard Warburton's blog 

No comments:

Post a Comment