Thursday, November 29, 2018

Hive


I'm building a Hadoop ecosystem test harness that will allow a whole slew of applications to be tested in a single JVM, in a single integration test (source).

A test for this harness fired up Zookeeper, Kafka, HDFS, Spark and finally Hive. It feeds Kafka with messages and Spark Structured Streaming processes them writes them as Parquet to HDFS. Finally, Hive reads this file and checks that what it reads is what Spark processes.

This all works fine until I decided to partition the Spark DataStreamWriter. Then, Hive didn't see any data and the the test failed upon an assertion that what Spark sees, Hive sees. Annoyingly, the test did not fail because of what is essentially misconfiguration. The reason being that Hive keeps a store of meta data about all the things it can see and although the data is there to be read, it's not been told to do so until the metastore is updated.

So, putting a long Thread.sleep in the test code I then fired up Beeline with something like:

beeline> !connect  jdbc:hive2://localhost:40327/default

whereupon we're asked to enter a user name and password. First, we recognize the data:

0: jdbc:hive2://localhost:40327/default> create external table parquet_table_name (key String, value String) PARTITIONED BY (partitionKey Date)  STORED AS PARQUET LOCATION 'hdfs://127.0.0.1:33799/tmp_parquet/';
No rows affected (0.077 seconds)

but we can't see any rows:

0: jdbc:hive2://localhost:40327/default> select count(*) from parquet_table_name;
No rows selected (0.204 seconds)
+------+
| _c0  |
+------+
| 0    |
+------+

This seems to be because there are no partitions:

0: jdbc:hive2://localhost:40327/default> show partitions parquet_table_name;
+------------+
| partition  |
+------------+
+------------+
No rows selected (0.219 seconds)

The solution was to add the partitions manually:

1: jdbc:hive2://localhost:40327/default> ALTER TABLE parquet_table_name ADD PARTITION (partitionKey='2018-11-28') location 'hdfs://127.0.0.1:33799/tmp_parquet/partitionKey=2018-11-28/';
No rows affected (0.155 seconds)
1: jdbc:hive2://localhost:40327/default> show partitions parquet_table_name ;
No rows affected (0.327 seconds)
+--------------------------+
|        partition         |
+--------------------------+
| partitionkey=2018-11-28  |
+--------------------------+

Contrary to the advice I read elsewhere, MSCK REPAIR TABLE parquet_table_name  SYNC PARTITIONS;  did not seem (because of camel-case names) to help me (the command ALTER TABLE table_name RECOVER PARTITIONS; seems to be just for Amazon's version of Hive).

Now, counting the rows gives me (some of) my data:

1: jdbc:hive2://localhost:40327/default> select count(*) from parquet_table_name;
+-------+
|  _c0  |
+-------+
| 2879  |
+-------+
1 row selected (1.939 seconds)

I'm not a Hive expert so there may be a solution that just adds anything in the directory (despite all the advice on forums, I could not get that to work the column name on which it's partitioned needs to be lowercase and mine was camel case). However, this hack works for now.

There appears to be a JIRA to make this automatic but there is nothing at the moment.

No comments:

Post a Comment