Introduction
When I'd tinkered with Zeppelin in the past, it seemed a bit buggy but using zeppelin-0.8.1-bin-netinst has been reasonably straightforward.
The reason we want to get it into my current project is that the analysts want to play with and visualise the data without getting to close to Spark.
Configuring
I had to add dependency:
org.apache.commons:commons-lang3:3.5
to the Spark interpreter (see StackOverflow) because I was getting:
Caused by: java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser; local class incompatible: stream classdesc serialVersionUID = 2, local class serialVersionUID = 3
when trying to call Spark. I also added the properties:
HADOOP_CONF_DIR
HADOOP_HOME
SPARK_HOME
to the interpreter config in the web admin GUI and
export MASTER=yarn-client
export HADOOP_CONF_DIR=/media/sdb8/Tools/Cluster/Hadoop/Current/etc/hadoop/
export SPARK_HOME=/media/sdb8/Tools/Cluster/Spark/Current/
to ZEPPELIN_HOME/conf/zeppelin-env.sh.
I also set up Hadoop again (following my own advice here) but this time I had to unexpectedly tweak the config as my secondary name node was trying to contact 0:0:0:0. It required defining dfs.namenode.secondary.http-address in hdfs-site.xml (see StackOverflow) and my problem went away.
Running
Running
Gratifyingly, I got:
To add your own code to the classpath by setting spark.jars to your uber-jar (see the documentation).
Lo and behold, we can now use Zeppelin to ask Spark to calculate trigrams on a corpus using code I'd written elsewhere:
Note, to then use PySpark and all the libraries you've installed with pip, you'll want to set the zeppelin.pyspark.python property in the Spark interpreter (not the zeppelin.python property in the Python interpreter) and un-tick zeppelin.pyspark.useIPython. Otherwise, all those Python libraries you've installed won't be found.
Lo and behold, we can now use Zeppelin to ask Spark to calculate trigrams on a corpus using code I'd written elsewhere:
Note, to then use PySpark and all the libraries you've installed with pip, you'll want to set the zeppelin.pyspark.python property in the Spark interpreter (not the zeppelin.python property in the Python interpreter) and un-tick zeppelin.pyspark.useIPython. Otherwise, all those Python libraries you've installed won't be found.
Addendum (23/5/2019)
After a few weeks of playing with Zeppelin, I'm still pretty impressed despite a few glitches in the GUI. However, it's still annoying that I cannot push/pull to a Git repository. And I got bit by this (StackOverflow) too,
It's also worth mentioning that if you're running Zeppelin on a headless server (like AWS), you will need to follow the instructions here if you want some of Python's wonderful graphics libraries.
After a few weeks of playing with Zeppelin, I'm still pretty impressed despite a few glitches in the GUI. However, it's still annoying that I cannot push/pull to a Git repository. And I got bit by this (StackOverflow) too,
It's also worth mentioning that if you're running Zeppelin on a headless server (like AWS), you will need to follow the instructions here if you want some of Python's wonderful graphics libraries.