Some notes from a colleague who knows HBase better than I do.
Quick overview of how HBase stores data
When HBase batch writes in-memory data to disk, it may do so in separate files. Each file is sorted by the key.
So, say HBase persists data with keys, A, F, B. It sorts them so the file looks like:
[A, B, F]
Some time later, HBase persists G, P and M. Now there are two files that look like:
[A, B, F], [G, M, P]
At some later time still, data with keys C, R and D arrive. Key R clearly comes after the last letter already written to the files (P) but D and C comes in the middle of the first file. So, what does it do? It creates a new file with [C, D, R] in it. To summarize, the files look like this:
[A, B, F], [G, M, P], [C, D, R]
Even later, a client wants to find the value for key D. Assuming it's not cached in memory, where does HBase look for it? Well, although all data within a file is ordered (so we can call off the search early if it's not where we would expect it in the file), we don't know which file D is in. HBase must search all files even if it doesn't search through all of a file.
To help matters, HBase can compact these three files into one that looks like:
[A, B, C], [D, F, G], [M, P, R]
Well, that makes searching much easier.
How can we leverage this? Well, in certain read-write patterns, we could force a compaction at an appropriate point. For instance, I'm using HBase to store a dictionary I build when processing a corpus of text. Later, I refer to that dictionary but I never update or add any more data. So, after I have finished my last write, I can call:
After compacting the table, my read throughput went up by a factor of 4.
For more information on HBase compaction see here.