With memory running low, a Java process may grind almost to a halt. This can be due to thrashing in the garbage collector.
A handy utility
Before our JVMs died with OutOfMemoryErrors, I was looking at the GC activity. I used jstat to take a reading every 10s (the 10000 (ms) argument below) and got:
:~> jstat -gc 24519 10000
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
7104.0 7104.0 0.0 0.0 131392.0 131392.0 1165120.0 1165119.9 35712.0 35327.1 12978 170.599 9002 30293.627 30464.226
7104.0 7104.0 0.0 0.0 131392.0 131392.0 1165120.0 1165119.9 35712.0 35330.2 12978 170.599 9004 30300.773 30471.371
7104.0 7104.0 0.0 0.0 131392.0 131392.0 1165120.0 1165120.0 35712.0 35329.4 12978 170.599 9007 30312.233 30482.832
.
.
[24519 was the PID of the process]
FGCT (Full Garbage Collection Time [1]) was the interesting metric. At this point of time, full garbage collection was taking about 10s every 10s or so and the total garbage collection time (GCT) was almost entirely due to it (not the young generation).
[Compare this to a more healthy JVM that had been running for a day:
~> jstat -gc 22226 10000
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
1216.0 1216.0 990.9 0.0 143168.0 27467.5 1165120.0 1088493.7 54016.0 35576.7 9227 89.841 10 15.148 104.989
1216.0 1216.0 0.0 992.0 143168.0 46244.1 1165120.0 1088829.1 54016.0 35576.7 9228 89.850 10 15.148 104.997
1216.0 1152.0 974.8 0.0 143232.0 76039.4 1165120.0 1089091.5 54016.0 35576.7 9229 89.856 10 15.148 105.004
where only 15 seconds out of the last 24 hours has been due to a full garbage collection.]
Full garbage collection is a stop-the-world event [2, 3, 4] so this explains why the application's responsiveness becomes so poor. It may be worth keeping an eye on this figure.
Parallel Scavenge GC
The default garbage collector on our box was the Parallel Scavenger. I've talked about the Young Generations behaviour with this GC here. But the Old Generation GC has a different strategy. It's called PS MarkSweep. This uses the well known mark-and-sweep algorithm but also a compaction algorithm to stop the memory space looking like Swiss cheese [5]. This can be expensive.
What actually triggers an OOME?
What triggers an OOME is not necessarily there not being enough memory to instantiate an object. Amongst other reasons are "98%+ time is spent in GC" and "less than 2% heap is freed in a collection" [2]. This is not very deterministic.
[1] http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html
[2] QCon lecture
[3] http://www.fasterj.com/articles/oraclecollectors1.shtml
[4] http://apmblog.compuware.com/2011/03/10/major-gcs-separating-myth-from-reality/
[5] Dr Richard Warburton's blog
A handy utility
Before our JVMs died with OutOfMemoryErrors, I was looking at the GC activity. I used jstat to take a reading every 10s (the 10000 (ms) argument below) and got:
:~> jstat -gc 24519 10000
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
7104.0 7104.0 0.0 0.0 131392.0 131392.0 1165120.0 1165119.9 35712.0 35327.1 12978 170.599 9002 30293.627 30464.226
7104.0 7104.0 0.0 0.0 131392.0 131392.0 1165120.0 1165119.9 35712.0 35330.2 12978 170.599 9004 30300.773 30471.371
7104.0 7104.0 0.0 0.0 131392.0 131392.0 1165120.0 1165120.0 35712.0 35329.4 12978 170.599 9007 30312.233 30482.832
.
.
[24519 was the PID of the process]
FGCT (Full Garbage Collection Time [1]) was the interesting metric. At this point of time, full garbage collection was taking about 10s every 10s or so and the total garbage collection time (GCT) was almost entirely due to it (not the young generation).
[Compare this to a more healthy JVM that had been running for a day:
~> jstat -gc 22226 10000
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
1216.0 1216.0 990.9 0.0 143168.0 27467.5 1165120.0 1088493.7 54016.0 35576.7 9227 89.841 10 15.148 104.989
1216.0 1216.0 0.0 992.0 143168.0 46244.1 1165120.0 1088829.1 54016.0 35576.7 9228 89.850 10 15.148 104.997
1216.0 1152.0 974.8 0.0 143232.0 76039.4 1165120.0 1089091.5 54016.0 35576.7 9229 89.856 10 15.148 105.004
where only 15 seconds out of the last 24 hours has been due to a full garbage collection.]
Full garbage collection is a stop-the-world event [2, 3, 4] so this explains why the application's responsiveness becomes so poor. It may be worth keeping an eye on this figure.
Parallel Scavenge GC
The default garbage collector on our box was the Parallel Scavenger. I've talked about the Young Generations behaviour with this GC here. But the Old Generation GC has a different strategy. It's called PS MarkSweep. This uses the well known mark-and-sweep algorithm but also a compaction algorithm to stop the memory space looking like Swiss cheese [5]. This can be expensive.
What actually triggers an OOME?
What triggers an OOME is not necessarily there not being enough memory to instantiate an object. Amongst other reasons are "98%+ time is spent in GC" and "less than 2% heap is freed in a collection" [2]. This is not very deterministic.
[1] http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html
[2] QCon lecture
[3] http://www.fasterj.com/articles/oraclecollectors1.shtml
[4] http://apmblog.compuware.com/2011/03/10/major-gcs-separating-myth-from-reality/
[5] Dr Richard Warburton's blog