Friday, September 2, 2016

Miscellaneous HBase tips

I'm new(ish) to HBase so here are a few notes I've jotted down over the last few months. Basically, Spark is great at what it does, but if you need to look something up while processing an element, you're best relying on another tool. HBase has been our choice.

HBase is not like a normal database

You can't do a simple select count(*)... You need to do something like:

$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter

See here for more information.

The HBase shell is not even SQL like. If you want to limit the number of rows returned, the syntax looks like:

hbase> scan 'test-table', {'LIMIT' => 5}

In "Hadoop: the Definitive Guide" there is a great "Whirlwind Tour of the Data Model". What follows is an even more condensed precis.
"Table cells - the intersection of rows and column coordinates - are versioned. By default, their version is a timestamp... A cell's contents is an uninterpreted array of bytes.
"Table rows are sorted by ... the table's primary key... All table accesses are via the primary key
"Row columns are grouped into column families. All column family members have a common prefix.
"Physically, all column family members are stored together on the filesystem [therefore] it is advised that all column family members have the same general access patterns and size characteristics.
"Tables are automatically partitioned horizontally by HBase into regions. Initially, a table comprises of a single region but as the size of the region grows, it splits...Regions are the units that get distributed over an HBase cluster.
"Row updates are atomic no matter how many row columns constitute the row-level transaction." 
Unsurprisingly for a solution based on Hadoop, HBase shares a similar architecture. The "HBase master node orchestrat[es] a cluster of one or more regionserver slaves." Unlike Hadoop, "HBase depends on ZooKeeper ... as the authority on cluster state".

"At a minimum, when bootstrapping a client connection to an HBase cluster, the client must be passed the location of the ZooKeeper ensemble. Thereafter, the client navigates the ZooKeeper hierarchy to learn cluster attributes such as server locations." For the purposes of creating a Spark artifact, this is simply a matter of adding the host machine's hbase-site.xml at the top level.


Find the load the cluster is under with:

echo "status 'detailed'" | hbase shell 2>/dev/null | grep requestsPerSecond | perl -pe s/,.*//g

With this, I've seen nodes in my cluster happily reach 60 000 requests per second each, which is most pleasing.

However, the load over the nodes is not terribly evenly distributed. One way to tackle this problem is salting. I did actually preempt this problem by reversing the Strings that were my keys. Since each key has a prefix taken from a fairly small set, I was expecting them to form "lumps" of data. However, I then create the HBase table with something like:

hbaseAdmin.createTable(hTableDescriptor, toBytes('0'), toBytes('z'), 20)

(where 20 is my number of regions).

However, this assumes that the text (even when reversed) that I am using as keys is evenly distributed over the alphanumerics (it's not as it's English words rather than random text). So, I still have some lumpiness.

Another optimization is to define the regions at table creation time (see the HBaseAdmin.createTable method that takes a startKey and endKey). This is to mitigate the problem that when a table is created, there is only one region. "The reason HBase creates only one region for the table is that it cannot possibly know how to create the split points within the row key space. Making such decisions is based highly on the distribution of the keys in your data." (from here). "Since pre-splitting will ensure that the initial load is more evenly distributed throughout the cluster, you should always consider using it if you know your key distribution beforehand."

Also, storing data in the L1 cache can speed things up hugely. This can be set on using this method. There are caveats (L1 is generally used for metadata like Bloom Filters where "false positive matches are possible, but false negatives are not" etc). But when I implemented this my tables, throughput using a get call went up by an order of magnitude. Before implementing it, the symptoms were that the CPU usage was very low (a load average of about 1 to 2 on a 24-core machine), the client threads waiting on the get method and not much activity in HBase itself.

Finally, if you're reading, consider how you're batching. The get method is overloaded to take a number of keys. But batch size matters. By increasing the batch size from 10s to 100s, I achieved a 5 times increase in performance (to about 130 000 gets per second per node). Too high a batch size, however, and it blows up. Quite why is something I am investigating.

No comments:

Post a Comment