We wanted full control of what was running in our cluster so we installed our own rather than use Cloudera or HortonWorks. These are some things I learned along the way.
HDFS
Getting a simple distributed filesystem up and running was relatively straight forward. This StackOverflow answer and this gives the minimum work required.
Spark
Spark doesn't need Hadoop to run. So, simplest thing that works, I started a Spark cluster by following the instructions at DataWookie. This made running a Spark shell in a cluster as simple as adding the switch --master spark:SPARK_MASTER_HOST:7077.
Although simple, such configuration brought down my cluster when I tried to do something that was computationally expensive (my Linux boxes died with oom_reaper messages in /var/log/messages).
So, I did the sensible thing and ran it under YARN with the --master yarn switch.
YARN
We didn't find YARN needed much additional configuration. The defaults were sufficient with the exception of:
- yarn.nodemanager.vmem-check-enabled should be set to false as it kept saying my Spark jobs did not have enough virtual memory (see StackOverflow).
- yarn.resourcemanager.hostname should be set to the machine where YARN's ResourceManager runs. Failing to do this will likely lead to HDFS running on the other nodes in the cluster but not YARN jobs (see StackOverflow).
- yarn-site.xml yarn.nodemanager.resource.memory-mb should be set to be a high percentage of the total memory of the cluster (StackOverflow). Note that the default is a measly 8gb so your Spark jobs will silently use few resources if you don't change this.
These are to be set in yarn-site.xml. The only other file to configure was capacity-scheduler.xml as I was the only person using the cluster and wanted to hog the resources (in particular, yarn.scheduler.capacity.maximum-am-resource-percent).
Please do remember to copy the $HADOOP_HOME/etc directory onto all nodes in the cluster when you make changes and restart the cluster for good measure.
Check that all your nodes are running by executing:
yarn node -list
Hive
Hive was the most annoying to get running so beware. It appears that in addition to copying the JARs from this article (Apache) on how to integrate it with Spark, you also must copy the spark-yarn and scala-reflect JARs into Hive's lib directory and also remove the Hive 1.2.1 JARs from the HDFS directory.
You'll need to configure Hive to talk to the database of your choice to allow it to store its metadata. This StackOverflow answer is for MySQL but the principle remains the same for me when I was using Derby.
Derby
Start Derby with something like $DERBY_HOME/bin/startNetworkServer -h HOST where HOST is the IP address you've told Hive to bind to.
This Apache resource is good for checking your Derby server is up and running.
Kafka
Kafka and Zookeeper were surprisingly easy to set up as the Apache Quickstart page was very well written. It just remains to say that Kafka "keys are used to determine the partition within a log" (StackOverflow) and that if you want to clean the logs this StackOverflow answer may be useful
Check your disk space
Note that the data for old jobs hang around in $SPARK_HOME/work so you regularly need to clean this up.
If you don't, jobs just won't start and Yarn will complain there are not enough resources but not tell you what those deficient resources are. That your disk space is running low is not obvious.
YARN barfs because yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage is set to 90%. So, it is not sufficient to have "enough" disk space in absolute terms. See Cloudera for more information.
In addition, you may want to set hadoop.tmp.dir in core-site.xml to a partition with lots of free space or you might see Spark jobs failing with "Check the YARN application logs for more details".
If in doubt...
... check the logs.
Keep checking $HADOOP_HOME/logs, particularly the yarn-*.log files. They'll give an indication of why things are not working although the messages can sometimes be misleading.
The disk space issue above manifested itself when Hive was running even though it was ultimately due to Yarn. Hive would throw timeout exceptions in its log. YARN would say there were not enough resources even though nothing else was running (yarn application -list). Looking at YARN's pending applications, I saw just my job sitting there but not running. The hint was that the equivalent job would run on Spark when running outside of Yarn and on the Spark master.
Hive's logs default to /tmp/HIVE_USER/hive.log and this is the first place to turn when things go wrong.