Friday, April 12, 2019

Zeppelin 1


Introduction

When I'd tinkered with Zeppelin in the past, it seemed a bit buggy but using zeppelin-0.8.1-bin-netinst has been reasonably straightforward.

The reason we want to get it into my current project is that the analysts want to play with and visualise the data without getting to close to Spark.

Configuring

I had to add dependency:

org.apache.commons:commons-lang3:3.5

to the Spark interpreter (see StackOverflow) because I was getting:

Caused by: java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser; local class incompatible: stream classdesc serialVersionUID = 2, local class serialVersionUID = 3

when trying to call Spark. I also added the properties:

HADOOP_CONF_DIR
HADOOP_HOME   
SPARK_HOME

to the interpreter config in the web admin GUI and

export MASTER=yarn-client
export HADOOP_CONF_DIR=/media/sdb8/Tools/Cluster/Hadoop/Current/etc/hadoop/
export SPARK_HOME=/media/sdb8/Tools/Cluster/Spark/Current/

to ZEPPELIN_HOME/conf/zeppelin-env.sh.

I also set up Hadoop again (following my own advice here) but this time I had to unexpectedly tweak the config as my secondary name node was trying to contact 0:0:0:0. It required defining dfs.namenode.secondary.http-address in hdfs-site.xml (see StackOverflow) and my problem went away.

Running

Gratifyingly, I got:


To add your own code to the classpath by setting spark.jars to your uber-jar (see the documentation).

Lo and behold, we can now use Zeppelin to ask Spark to calculate trigrams on a corpus using code I'd written elsewhere:


Note, to then use PySpark and all the libraries you've installed with pip, you'll want to set the zeppelin.pyspark.python property in the Spark interpreter (not the zeppelin.python property in the Python interpreter) and un-tick zeppelin.pyspark.useIPython. Otherwise, all those Python libraries you've installed won't be found.

Addendum (23/5/2019)

After a few weeks of playing with Zeppelin, I'm still pretty impressed despite a few glitches in the GUI. However, it's still annoying that I cannot push/pull to a Git repository. And I got bit by this (StackOverflow) too,

It's also worth mentioning that if you're running Zeppelin on a headless server (like AWS), you will need to follow the instructions here if you want some of Python's wonderful graphics libraries.

Thursday, April 4, 2019

Master of your domain name


Introduction

I've had moderate success using neural nets to distinguish between good domain names and domain names generated by an algorithm, something you often find in malware. However:

  1. Tuning neural nets is hard and requires esoteric knowledge.
  2. They're computationally expensive both to train (OK) and to use (not OK).
  3. The results I have so far are at best reasonable and no better.

So, using the same raw data I used in my neural net work, I tried to do better.

Jensen-Shannon, bigram population distribution

I used the SMaths library to create the Jensen-Shannon scores comparing the character distribution for both good and bad domain names against the distribution for all good names using 1-hot character encoding. Unfortunately, the results are poor:

JS scores for good (green) and bad (red) domain names

Just Shannon Entropy

What do you do when there the probability for an n-gram is zero? The common answer is "ignore the zero probabilities, and carry on summation using the same equation" (StackExchange).

But I found penalizing zeros gave me better powers to differentiate the two categories. The value of this penalty was empirically derived but roughly optimising it with a crude binary search gave me a much better spread of the distributions:

Penalized Shannon entropy scores for good (green) and bad (red) domain names
and consequently a great ROC curve:

The Data

There is, however, a sting in the tail. Some of the good domain names appear to be bad!

If we look at that ROC curve in 3d:
The same ROC curve in 3d showing the threshold's relationship with the curve
And we estimate that the best value for our threshold is roughly -0.018 to maximise catching bad domains with the minimum disruption from good domains that just look a bit suspicious.

Looking at the false negatives was illuminating. Here is a small sample:

dynserv
bowupem
ikrginalcentricem
gentlemanwritten
qmigfordlinnetavox
jfbjinalcentricem
osbumen
vxheellefrictionlessv

These show a level of sophistication greater than those generated by a DGA. They contain normal (or normal-sounding) words. Words like frictionless appear in a number of these domains, adding an air of authenticity. Indeed, gentlemanwritten.net sounds positively respectable.

The false positives were even more interesting. Here is a very small selection:

95a49f09385f5fb73aa3d1e994314a45b8d51f17
mhtjwmxf
wlhzfpgs
kztudyya

These look awfully suspicious. In fact, attempts to look them up with whois reveals nothing at all - which is odd.

Conclusion

Although we're getting better results with a less computationally expensive solution than neural nets, we always assumed that the training data was clean. In fact, it appears to be reasonably clean but not pristine. On the upside, this means that if we clean the data we can reasonably expect our false positive rate to go down even further.