Friday, March 23, 2018

Hands on with Gradient Boosting


Boosting is a meta-algorithm, that is "an algorithm that exists to manipulate some other algorithm".

It goes like this (from [1]):

  1. Draw a random subset of training samples d1 without replacement from the training set D to train a weak learner C1.
  2. Draw second random training sample d2 without replacement from D and 50% of the samples that were misclassified by C1. Give this to C2.
  3. Find the training sample d3 from D for which C1 and C2 disagree. Give this to C3.
  4. Find with majority voting the answers from C1, Cand C3.
Bagging vs. Boosting

"Bagging is a simple ensembling technique in which we build many independent predictors/models/learners and combine them using some model averaging techniques. (e.g. weighted average, majority vote or normal average)...

"Boosting is an ensemble technique in which the predictors are not made independently, but sequentially." (from Prince Grover)

So, as a mnemonic, think booSting is Sequential and bAgging is pArrallel.

Any time, any place anywhere

Because boosting is a meta algorithm, it can be used on lots of classifiers in Knime.

For instance, I used Knime's Palladian nodes to classify the "20 Newsgroups" data set. This algorithm extensively uses n-grams. With a min/max n-gram size of 3/10, it gave an overall accuracy of 88.3% in classification.

Although Palladian's classifier nodes are being used, they could be any classifier like Naive Bayes
So, I boosted the results 10 times... And only got 85.9%. Boosting 100 times gave 78.5%. Hmm, what gives?

This StackOverflow post gives a hint. "In general, boosting error can increase with the number of iterations, specifically when the data is noisy (e.g. mislabeled cases)... Basically, boosting can 'focus' on correctly predicting cases that contain misinformation, and in the process, deteriorate the average performance on other cases that are more substantive.”

The post also has an interesting chat between data scientists about boosting and overfitting. “I think it's interesting that you've rarely seen gradient boosting overfit. Over the four or so years that I've been using it, I've seen the opposite -- too many trees leads to overfitting”.

Indeed, Raschka writes that boosting algorithms are known for the "tendency to overfit the training data".

Features! Features! Features!

What seemed to have the most impact on Palladian's algorithm was the choice of min/max n-grams. A value of 15/15 gave only 67% accuracy comparing poorly to the 88% of 3/10.

[1] Python Machine Learning, Sebastian Raschka

Tuesday, March 20, 2018

Fighting Orcs


Orc has an amazing ability to compress data. A Parquet file of 1.1tb shrank to 15.7gb when saved as Orc although note that the Orc file is compressed. "The default compression format for ORC is set to snappy" (StackOverflow). This indeed seems to be the case as you can see snappy in the name of files in the HDFS directory.

But how fast is querying it?

Well, there was an initial pause when first reading the file with:

val orcFile = spark.read.orc(HDFS_DIR_NAME)

It seemed that the driver was doing all the work and all 60 executors I had asked my shell to employ were doing nothing. Using jstack showed that the driver was trying to infer the schema (see o.a.s.s.e.d.DataSource.getOrInferFileFormatSchema). Adding the schema would mitigate this.

Now, let's group by a field called dax.

orcFile.groupBy('dax).agg(count('dax))

(There are 36 different values for dax over nearly 890 million rows with 292 columns).

This groupBy took about 50s when the underlying format was Parquet and 37s when it was Orc (not including the perfectly avoidable large read time) so not bad. How does Orc do it? It has 'indexes'. Sort of. "The term 'index' is rather inappropriate. Basically it's just min/max information persisted in the stripe footer at write time, then used at read time for skipping all stripes that are clearly not meeting the WHERE requirements, drastically reducing I/O in some cases" (StackOverflow).

In fact, you can see this metadata using the Hive command line:

hive --orcfiledump HDFS_DIR_NAME

Can we make it even faster? I thought sorting and repartitioning might make a difference so I executed this:

orcFile.sort("dax").repartition(200).write.partitionBy("dax").save(SaveMode.Overwrite).orc(HDFS_DIR)

Note, cacheing the original Dataframe would cause OOMEs for reasons unknown to me at the moment. And without repartitioning, the third (of four) stages would take so long I had to kill the job. Repartitioning helped but I did notice a huge amount of shuffle (100s of gbs when the original file itself was only 16gb).

Also, I had to keep increasing the drivers' memory until I stopped getting OOMEs in the last stage. For reasons that I don't currently understand, the memory had to increase to 20gb before the job finished. Even then, it took 2.5 hours on 30 executors with 5 cores each and the resulting directory took 146gb of disk space.

However, the speed of a query was staggeringly fast. This:

partitionedOrc.where('dax === 201501).groupBy('dax).agg(count('dax)).show(10)

took a mere 2s. By comparison, the same query on the unpartitioned and unsorted Orc file took about 32s.

This speed is comparable to Apache Impala which "can be awesome for small ad-hoc queries" (StackOverflow).

Note that Impala, being "highly memory intensive (MPP), it is not a good fit for tasks that require heavy data operations like joins etc., as you just can't fit everything into the memory. This is where Hive is a better fit" (ibid).

Note that this speed increase appears to be available only when querying purely on that which is partitioned. For instance, let's take another field, mandant, and run a slightly modified version of the query above:

partitionedOrc.where('mandant === 201501).groupBy('dax).agg(count('dax)).show(10)

This takes 32s as well.

Monday, March 5, 2018

DeepLearning4J in a Spark cluster


Running DL4J on Spark was not too hard but there were some none obvious gotchas.

Firstly, my tests ran on a Windows machine but I needed to add dependencies to get it to run on our Linux cluster where I was getting "no openblas in java.library.path" errors. This link helped me and now my dependencies looked like this:

    <dependency>
      <groupId>org.deeplearning4j</groupId>
      <artifactId>dl4j-spark_${scala.compat.version}</artifactId>
      <version>${deeplearning4j.version}_spark_2</version>
    </dependency>
    <dependency>
      <groupId>org.deeplearning4j</groupId>
      <artifactId>deeplearning4j-core</artifactId>
      <version>${deeplearning4j.version}</version>
    </dependency>
    <dependency> <!-- remember to run spark with conf "spark.kryo.registrator=org.nd4j.Nd4jRegistrator" -->
      <groupId>org.nd4j</groupId>
      <artifactId>nd4j-kryo_${scala.compat.version}</artifactId>
      <version>${nd4j.version}</version>
    </dependency>
    <dependency>
      <groupId>org.nd4j</groupId>
      <artifactId>nd4j-native-platform</artifactId>
      <version>${nd4j.version}</version>
    </dependency>
    <dependency>
      <groupId>org.nd4j</groupId>
      <artifactId>nd4j-native</artifactId>
      <version>${nd4j.version}</version>
    </dependency>

with

    <deeplearning4j.version>0.9.1</deeplearning4j.version>
    <nd4j.version>0.9.1</nd4j.version>

Second, you need to add --conf "spark.kryo.registrator=org.nd4j.Nd4jRegistrator"  to the CLI when you start a Spark shell.

Thirdly, I randomly grabbed some Recurrent Neural Network from here just to test my code. I found that it was immensely memory-hungry. I needed to give my driver 20gb, my executors 30gb and only 1 core per executor to avoid occasional errors in the Spark stages.

Even then, I couldn't measure the accuracy of my neural net because of this issue. Apparently, it's fixed in the SNAPSHOT but then there are issues with the platform JARs not being up to date. I asked about this on the DeepLearning4J gitter channel archived here. (The team also helpfully told me to use org.deeplearning4j.eval.Evaluation with an argument to avoid the bug).

Finally, I started getting results but not before one last gotcha and this time in Spark: use RDD.sample to get your hands on data with which to test rather than take as you want a nice distribution over all categories. With this, I started getting more sensible answers when evaluation my results.