Saturday, February 1, 2020

Kafka, Spark and HDFS in Docker on one Laptop


Starting Spark, HDFS and Kafka all in a Docker-ised environment is very convenient but not without its niggles. Here's what I did to run a Spark Structured Streaming app on my laptop.

Start a Kafka/ZK cluster in Docker following this link [GitHub] and for Spark/HDFS, try here [GitHub]. Note that in the Kafka/ZK config, you will have to change the value for KAFKA_ADVERTISED_HOST_NAME in docker-compose.yml to correspond to your computer each time you fire it up.     

Note that Docker creates a virtual network. You can add virtual networks to containers [SO] but you don't need to worry about that if you use docker-compose up -d. The -d switch prevents the containers output being spurged to stdout.

You can test Zookeeper is indeed up and running with:

$ echo ruok | nc localhost 2181
imok$

Note that on the host machine, you can run:

$KAFKA_HOME/bin/kafka-topics.sh --list --zookeeper localhost:2181
__consumer_offsets
test_topic

And see the internal topic __consumer_offsets has automatically been created to store offsets. The topic test_topic is something I made that we'll need later.

We can even jump onto the Kafka container and watch the logs:

$ docker exec -it kafkadocker_kafka_1 bash
bash-4.4# tail -f /opt/kafka_2.12-2.4.0/logs/server.log

Let's now see who has the Kafka port:

$ docker ps | grep 9092
02fc5122f6e2        kafkadocker_kafka                                "start-kafka.sh"         9 minutes ago       Up 9 minutes             0.0.0.0:32770->9092/tcp                                    kafkadocker_kafka_1

Note that this means that the Kafka server is listening on the host OS on port 32770 not 9092. We'll need this to tell Spark where Kafka is.

Note the Spark worker has just 1gb of heap.

docker exec -it dockerhadoopsparkworkbench_spark-worker_1 bash
root@a68aff72a10f:/# jps
229 Worker
1178 Jps
root@a68aff72a10f:/# cat /proc/229/cmdline
/docker-java-home/bin/java-cp/spark//conf/:/spark/jars/*:/etc/hadoop/:/opt/hadoop-2.8.0/share/hadoop/common/lib/*:/opt/hadoop-2.8.0/share/hadoop/common/*:/opt/hadoop-2.8.0/share/hadoop/hdfs/:/opt/hadoop-2.8.0/share/hadoop/hdfs/lib/*:/opt/hadoop-2.8.0/share/hadoop/hdfs/*:/opt/hadoop-2.8.0/share/hadoop/yarn/lib/*:/opt/hadoop-2.8.0/share/hadoop/yarn/*:/opt/hadoop-2.8.0/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.8.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar-Xmx1gorg.apache.spark.deploy.worker.Worker--webui-port8081spark://spark-master:7077root@a68aff72a10f:/#

If you want to execute a Spark shell, run:

$ docker exec -it spark-master /bin/bash ./spark/bin/spark-shell --master spark://spark-master:7077

Or, if you want to include a JAR:

docker run --rm -it --network dockerhadoopsparkworkbench_default --env-file ./hadoop.env -e SPARK_MASTER=spark://spark-master:7077 --volume  /home/henryp/Code/Scala/MyCode/SSSPlayground/target/:/example bde2020/spark-base:2.4.0-hadoop2.8-scala2.12 /spark/bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.0  --jars /example/SSSPlayground-1.0-SNAPSHOT-jar-with-dependencies.jar --master spark://spark-master:7077

Docker creates a virtual network and you can see that with:

$ docker network ls
NETWORK ID          NAME                                 DRIVER              SCOPE
2f804fb10173        bridge                               bridge              local
caeab723a6c7        dockerhadoopsparkworkbench_default   bridge              local
dbb8f4df303a        host                                 host                local
b8ba799f4916        kafkadocker_default                  bridge              local
a15ee4bf8c1f        kafkasparkhadoopzk_default           bridge              local
d796a747993d        none                                 null                local

The network bridge is the default but we'll use the network in which the Spark containers sit when we deploy the local uber-jar with:

docker run --rm -it --network dockerhadoopsparkworkbench_default --env-file ./hadoop.env -e SPARK_MASTER=spark://spark-master:7077 --volume  /home/henryp/Code/Scala/MyCode/SSSPlayground/target/:/example bde2020/spark-base:2.4.0-hadoop2.8-scala2.12 /spark/bin/spark-submit --class=uk.co.odinconsultants.sssplayground.windows.ConsumeKafkaMain --master spark://spark-master:7077  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.0  /example/SSSPlayground-1.0-SNAPSHOT-jar-with-dependencies.jar 10.107.222.63:32770 test_topic /streaming_test 600000

using paths relevent to you. This creates another container called spark-base that lives for just the duration of the app. For what it's worth, the code lives here [GitHub].

This command line takes some explaining:

  • We're getting docker to mount a directory with the --volume switch. 
  • To have Spark streaming from Kafka you need the dependency defined with the --packages switch. This is nothing to do with Docker but is essential for Spark and Kafka to talk to each other.
  • The address, 10.107.222.63, is my host OS's IP. 
  • Finally, we need to tell Spark where the Kafka bootstrap servers are and this is port 32770 we saw earlier.
Pumping messages through Kafka from the host OS leads to data accumulating in HDFS and we can see that with:

$ docker exec -it dockerhadoopsparkworkbench_datanode_1  hadoop fs -du -h /
946.7 M  /streaming_test
1.9 K    /streaming_testcheckpoint
47.7 K   /tmp

So, evidently things are working.

No comments:

Post a Comment