Monday, June 18, 2018

To Normalize or Not?

I've demonstrated in previous posts that pre-processing our data improved our accuracy more than hyperparameter tuning. But "preprocessing in machine learning is somewhat a very black art" as this StackExchange post says. "It is not written down in papers a lot why several preprocessing steps are essential to make it work... To make things more complicated, it depends heavily on the method you use and also on the problem domain."

All the answers on this page are enlightening - that given 4 points for age and height of people, the clustering is different if we use metric or imperial measurements since the distance between them will be different; that normalizing tends to make clusters in K-Means more distinct; that "if you have a neural network and just apply an affine transformation [x' = ax + b] to your data, the network does not lose or gain anything in theory. In practice, however, a neural network works best if the inputs are centered and white." There is a caveat for regularization.

I looked into this last point and did some (non-exhaustive) experiments. Results are from Spark's MultilayerPerceptronClassifier using 100 epochs. As ever, results are indicative and not to be taken too seriously (for instance, a difference of a percent may just be noise).

Word VectorSentence VectorFeature VectorAccuracy (%)

All sentence vectors were constructed from simply adding the relevant word vectors together.

Although I spent less time on TensorFlow, I also noticed that pre-processing was important there too. Using the same architecture, unnormalized data gave me about 50% accuracy while L2 normalized words gave me 93% accuracy. Neither accuracy seemed to be increasing after the 1000 epochs I let them run for.

Changing tack completely, I also one-hot encoded my words to form sentence vectors and achieved a respectable but not stellar accuracy of 84%.

I thought that perhaps I was helping the neural net by pre-processing but it would still learn anyway if given enough time. This appears not to be true. Taking the data with no processing of the word, sentence and column vectors, the accuracy with increasing number of epochs looked like this:

Number of epochsAccuracy (5)

This demonstrates a clear plateau in accuracy at about 85% when we know we can achieve an accuracy as high as 95% using the same classifier and the same data (if rendered somewhat differently).

Thursday, June 7, 2018

Spark Query Plans (Gotcha!)

I was exploring why Spark was taking hours to run a support vector machine on my data when other ML algorithms were taking minutes. I called .explain() on my DataFrame but I found surprisingly little information on the web concerning interpreting Spark query plans.

The best source I found was this at Simon Fraser University. The take-away points are:

  • "Read [query plans] from the bottom up."
  • Exchanges eg "Exchange hashpartitioning(..." are shuffles.
  • InMemoryRelation and InMemoryTableScan "will look in memory for data, calculating and caching if necessary". 
  • Range and Project are from select()s, withColumn() etc

Playing around, I found InMemoryRelation and InMemoryTableScan can be generated by adding a .cache().

scala> val df = sc.range(1, 100000).toDF
scala> df.explain()
== Physical Plan ==
*SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]

scala> val add1 = { r => r.getLong(0) - 1 }
scala> add1.explain()
== Physical Plan ==
*SerializeFromObject [input[0, bigint, false] AS value#10L]
+- *MapElements <function1>, obj#9: bigint
   +- *DeserializeToObject createexternalrow(value#2L, StructField(value,LongType,false)), obj#8: org.apache.spark.sql.Row
      +- *SerializeFromObject [input[0, bigint, false] AS value#2L]
         +- Scan ExternalRDDScan[obj#1L]

scala> val cachedAdd1 = add1.cache()
scala> cachedAdd1.explain()
== Physical Plan ==
InMemoryTableScan [value#10L]
   +- InMemoryRelation [value#10L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *SerializeFromObject [input[0, bigint, false] AS value#10L]
            +- *MapElements <function1>, obj#9: bigint
               +- *DeserializeToObject createexternalrow(value#2L, StructField(value,LongType,false)), obj#8: org.apache.spark.sql.Row
                  +- *SerializeFromObject [input[0, bigint, false] AS value#2L]
                     +- Scan ExternalRDDScan[obj#1L]

So, that explained the last two actions in my plan but I was no nearer understanding my problem. However, I did notice that there were lots of User Defined Functions in my DataFrame, presumably from Spark's Transformers that took my raw textual data and helped me turn them into vectors.

Could this be my problem? There appear to have been several problems with UDFs (see SPARK-17728 for example) and no shortage of people warning against them (for example here: "Using a UDF implies deserialization to process the data in classic Scala and then reserialize it. UDFs can be replaced by Spark SQL functions, there are already a lot of them and new ones are regularly added.")

But I could not replicate the issue by testing in the Spark shell:

scala> val df = sc.range(0, 5).toDF
scala> val noisyAdd1UDF = udf{ x: Long => println("noisyAdd1UDF = " + x); x + 1 } // not a pure function
scala> val dfWithUdf = df.withColumn("added", noisyAdd1UDF('value))
scala> dfWithUdf.explain()
== Physical Plan ==
*Project [value#283L, UDF(value#283L) AS added#287L]
+- *SerializeFromObject [input[0, bigint, false] AS value#283L]
   +- Scan ExternalRDDScan[obj#282L]

Now, let's show() it twice. Each time, the UDF is re-calculated:

noisyAdd1UDF = 0
noisyAdd1UDF = 4
|    0|    1|
|    4|    5|

noisyAdd1UDF = 0
noisyAdd1UDF = 2
|    0|    1|
|    4|    5|

So, no surprises. Calling cache() stopped subsequent show()s from calling the UDF again as expected.

The Gotcha

It turns out that the problem is not directly with a UDF but with the DataFrame that is created when we add the UDF. The new DataFrame does not inherit its parent's StorageLevel (note that other operations like randomSplit will do the same thing).

We can see this by doing:

scala> val df = sc.range(0, 100000).toDF
df: org.apache.spark.sql.DataFrame = [value: bigint]

scala> df.storageLevel
res67: = StorageLevel(1 replicas)

scala> df.cache()
res68: df.type = [value: bigint]

scala> df.storageLevel
res69: = StorageLevel(disk, memory, deserialized, 1 replicas)

scala> val dfWithUdf = df.withColumn("value", noisyAdd1UDF('value))
dfWithUdf: org.apache.spark.sql.DataFrame = [value: bigint]

scala> dfWithUdf.storageLevel
res70: = StorageLevel(1 replicas)

Now, while other Spark classifiers might also user withColumn, they discard the other columns that would call the UDF and thus result in the DataFrame being re-calculated. Whereas, OneVsRest does not do this. Indeed, it cannot as does not know whether those columns will be used by the classifiers it wraps.

And we can see this if we again look at the Query Plans. Our cached DataFrame will use InMemory plans:

scala> df.explain()
== Physical Plan ==
InMemoryTableScan [value#744L]
   +- InMemoryRelation [value#744L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *SerializeFromObject [input[0, bigint, false] AS value#744L]
            +- Scan ExternalRDDScan[obj#743L]

but although it sits on top of this cacheDataFrame, our non-cacheDataFrame will have as its top layer plan, a Project that calls our UDF:

scala> dfWithUdf.explain()
== Physical Plan ==
*Project [UDF(value#744L) AS value#753L]
+- InMemoryTableScan [value#744L]
      +- InMemoryRelation [value#744L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *SerializeFromObject [input[0, bigint, false] AS value#744L]
               +- Scan ExternalRDDScan[obj#743L]

In turn, cacheing this puts another layer of InMemoryXXX on top:

scala> dfWithUdf.explain()
== Physical Plan ==
InMemoryTableScan [value#753L]
   +- InMemoryRelation [value#753L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *Project [UDF(value#744L) AS value#753L]
            +- InMemoryTableScan [value#744L]
                  +- InMemoryRelation [value#744L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
                        +- *SerializeFromObject [input[0, bigint, false] AS value#744L]
                           +- Scan ExternalRDDScan[obj#743L]

The solution is to use checkpoint(). This collapses the Query Plan, freezing the results and obviates the need to recalculate anything.

Wednesday, June 6, 2018

Spark: checkpoint or cache?

Previously, 20k sentences would take a few minutes using Spark's built-in machine learning algorithms. So, I was surprised that 20k larger documents could take far, far longer.

Note: each data-point was reduced to a vector of length 20 irrespective of whether it was a short sentence or a long piece of text. So document size should have been irrelevant.

After over three hours, my One-vs-All, Support Vector Machine still was not finished.

I managed to fix the problem but first, here are the results for the models I used on the larger documents. Again, the figures are indicative as I did not run the models more than once.

ModelAccuracy (%)
LinearSVC and OneVsRest74.2

Despite calling .cache() on every DataFrame in sight and everything apparently fitting into memory with gigabytes to spare, performance was dreadful. (Note that DataSet.cache has a default StorageLevel of MEMORY_AND_DISK whereas RDD.cache defaults to MEMORY_ONLY).

Using jstat, it was clear all the executor threads were involved in some sort of serialization with deep stack-frames in

So, instead I tried calling checkpoint on the DataFrame before passing it to the machine learning algorithm. This has the effect of removing all lineage and writing to the HDFS directory you must set in sc.setCheckpointDir.

"Broadly speaking, we advise persisting when jobs are slow and checkpointing when they are failing" [1] say Karau and Warren but since my job was so slow it might as well have failed I had nothing to lose.

Indeed, the SVM was now able to process the data in a few minutes. Quite why cacheing didn't work as expected is something of a mystery that I will look at in another post.

[1] High Performance Spark

Monday, June 4, 2018

Manifold Destiny


Manifolds are "'spaces' of points, together with a notion of distance that looked like Euclidean distance on small scales but which could be quite different at larger scales." [1]

"Although there is a formal mathematical meaning to the term manifold, in machine learning it tends to be used more loosely to designate a connected set of points that can be approximated well by considering only a small number of degrees of freedom, or dimensions, embedded in a higher-dimensional space... In the context of machine learning, we allow the dimensionality of the manifold to vary from one point to another. This often happens when a manifold intersects itself. For example, a figure eight is a manifold that has a single dimension in most places but two dimensions at the intersection at the center.

"Many machine learning problems seem hopeless if we expect the machine learning algorithm to learn functions with interesting variations across all of ℝn. Manifold learning algorithms surmount this obstacle by assuming that most of ℝn consists of invalid inputs, and that interesting inputs occur only along a collection of manifolds containing a small subset of points, with interesting variations in the output of the learned function occurring only along directions that lie on the manifold, or with interesting variations happening only when we move from one manifold to another. Manifold learning was introduced in the case of continuous-valued data and the unsupervised learning setting, although this probability concentration idea can be generalized to both discrete data and the supervised learning setting: the key assumption remains that probability mass is highly concentrated."

The manifold hypothesis is the observation "that the probability distribution over images, text strings, and sounds that occur in real life is highly concentrated." [2]

Transforming the data

"The most logical way to transform hour is into two variables that swing back and forth out of sink. Imagine the position of the end of the hour hand of a 24-hour clock. The x position swings back and forth out of sink with the y position. For a 24-hour clock you can accomplish this with x=sin(2pi*hour/24),y=cos(2pi*hour/24).

"You need both variables or the proper movement through time is lost. This is due to the fact that the derivative of either sin or cos changes in time where as the (x,y) position varies smoothly as it travels around the unit circle.

"Finally, consider whether it is worthwhile to add a third feature to trace linear time, which can be constructed my hours (or minutes or seconds) from the start of the first record or a Unix time stamp or something similar. These three features then provide proxies for both the cyclic and linear progression of time e.g. you can pull out cyclic phenomenon like sleep cycles in people's movement and also linear growth like population vs. time" (StackExchange).

Normalizing the data

"If the input variables are combined linearly, as in an MLP, then it is rarely strictly necessary to standardize the inputs, at least in theory. The reason is that any rescaling of an input vector can be effectively undone by changing the corresponding weights and biases, leaving you with the exact same outputs as you had before. However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs" ( FAQ).

Just how normalization and standardization can effect the overall accuracy for different optimizers can be seen on Rinat Maksutov's blog here.

In general, it improves performance rather than making a difference. "It is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface" (StackOverflow).

Doing just the opposite, multiplying all elements in all vectors by a random factor between 0.5 and 1.5 made no difference to the accuracy (94.5%) of an ANN in the "20 Newsgroups" data set. By comparison, the choice of optimizer made a huge difference (15.6% for gradient descent irrespective of whether the vectors are normalized or not. The default optimizer for Spark is l-BFGS).

[1] The Princeton Companion to Mathematics
[2] Deep Learning (Goodfellow, Bengio, Courville)

Tuesday, May 29, 2018

Neural Net Architecture

Some more notes on the different types and aspects of neural net architecture - what to use and when.

Multilayer Perceptron

"Common columnar data has a static structure to it, and is best modeled in DL4J by a classic multilayer perceptron neural network. These problems may benefit from minor feature engineering, but we can often let the network find the best weights on the dataset on its own. Hyperparameter tuning is one of the main challenges when modeling with MLPs." [1]


"While we recommend the practitioner start with CNNs for image modeling problems, application of these networks have begun to appear for text modeling problems as well." [1] For example machine translation, sentence classification and sentiment analysis.

Batch Normalization

"Normalization is a broad category of methods that seek to make different samples seen by a machine-learning model more similar to each other, which helps the model learn and generalize well to new data. The most common form of data normalization is ... centering the data on 0 by subtracting the mean from the data, and giving the data a unit standard deviation by dividing the data by its standard deviation. In effect, this makes the assumption that the data follows a normal (or Gaussian) distribution and makes sure this distribution is centered and scaled to unit variance

"data normalization should be a concern after every transformation operated by the network... The main effect of batch normalization is that it helps with gradient propagation—much like residual connections—and thus allows for deeper networks." [2]

Bottleneck Layer

"A bottleneck layer is a layer that contains few nodes compared to the previous layers. It can be used to obtain a representation of the input with reduced dimensionality. An example of this is the use of autoencoders with bottleneck layers for nonlinear dimensionality reduction" (StackOverflow).

Recurrent Neural Networks

"Recurrent Neural Networks are in the family of feed-forward neural-networks. They differ from other feed-forward networks in their ability to send information over time-steps... Modeling the time dimension is a hallmark of Recurrent Neural Networks." [1]

"That is, a feedforward network has no notion of order in time, and the only input it considers is the current example it has been exposed to. Feedforward networks are amnesiacs regarding their recent past; they remember nostalgically only the formative moments of training.
Recurrent networks, on the other hand, take as their input not just the current input example they see, but also what they have perceived previously in time" (DeepLearning4J).

RNNs do this by adding their output in time step t-1 to the input in time step t.


"A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor... This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data... Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version." (Chris Olah's blog).


"When we see data that is the combination of both time and image (eg video) we use a special hybrid network of Long Short-Term Memory (LSTM) and convolutional layers.

"Generally, you want to use multiple epochs and one iteration (.iterations(1) option) when training; multiple iterations are generally only used when doing full-batch training on very small data sets" (DeepLearning4J).

"If the input is a fixed-length sequence (ie all time-series are the same length), you might as well use an  MLP or CNN." [1]

Activation Functions

"As long as the activation function is something nonlinear, a neural network with at least one hidden layer can approximate arbitrary functions... The nonlinear neural network model with a hidden layer ... is flexible enough to approximately represent any function! What a time to be alive!" [3] This is called the Universal Approximation Theorem.

See Chris Olah's blog: "Each layer stretches and squishes space, but it never cuts, breaks, or folds it... Transformations like this, which don’t affect topology, are called homeomorphisms. Formally, they are bijections that are continuous functions both ways...  Layers with NN inputs and NN outputs are homeomorphisms, if the weight matrix, WW, is non-singular."

"If you do not put an activation function between two layers, then two layers together will serve no better than one, because their effect will still be just a linear transformation... The reason why people use ReLU between layers is because it is non-saturating (and is also faster to compute). Think about the graph of a sigmoid function. If the absolute value of x is large, then the derivative of the sigmoid function is small, which means that as we propagate the error backwards, the gradient of the error will vanish very quickly as we go back through the layers. With ReLU the derivative is 1 for all positive inputs, so the gradient for those neurons that fired will not be changed by the activation unit at all and will not slow down the gradient descent.

"For the last layer of the network the activation unit also depends on the task. For regression you will want to use the sigmoid or tanh activation, because you want the result to be between 0 and 1. For classification you will want only one of your outputs to be one and all others zeros, but there's no differentiable way to achieve precisely that, so you will want to use a softmax to approximate it" (from StackOverflow).

"You don’t have to worry too much about which activation function is better under what circumstances. That’s still an active research topic... Usually, the best one is chosen by using cross-validation to determine which one gives the best model, given the dataset you’re working with." [3]

[1] Deep Learning: A Practitioner's Approach
[2] Deep Learning with Python
[3] Machine Learning with TensorFlow

Thursday, May 24, 2018

Vector Space, the Final Frontier

Given vectors that represent words, how do we construct sentences? Do we add the vectors? Do we find centroids? Do we normalize before, after or not at all?

In fact, can we even say we are dealing with a vector space?

Remember, a vector space has the following 8 properties:
  • Identity (of addition and multiplication)
  • Distributivity (of scalars and vectors)
  • Addition also has commutivity and associativity plus an inverse.
  • Compatibilty: (ab)v = a(bv)
But as pointed out on StackOverflow:

"The commutativity property of vector addition does not always hold in semantics. Therefore, this property shouldn't (always) hold in the embedding space either. Thus, the embedding space should not be called a vector space.

E.g. attempt to treat semantic composition as vector addition in the vector space:

vrescue dog = v rescuevdog (a dog which is trained to rescue people)

vdog rescue = vdogvrescue (the operation of saving a dog)

The phrases "rescue dog" and "dog rescue" mean different things, but in our hypothetical vector space, they would (incorrectly) have the same vector representation (due to commutativity).

Similarly for the associativity property."

At the same time, erroneous assumptions are not necessarily unacceptable (as the post points out). It's just a high-bias model.

Different Spaces

For fun, I tried the vectors that Word2Vec gave me. Now, there is no reason I could think of why the vectors this algorithm gives me for words should be used to form a sentence. But the results were surprising.

DescriptionAccuracy (%)
Raw vector addition81.0
Normalized vector addition27.9
Raw vector centroids7
Raw vector addition then normalizing7

[Note: figures are indicative only.]

That is, adding together Word2Vec generated word vectors to make a sentence meant my neural net produced decent (but not stellar) results.

More promising was combining vectors from category space. The results looked like this:

Category Space
DescriptionAccuracy (%)
Normalized vector addition94.0
Normalized vector centroids92.1
Adding unnormalized vectors6.2
Normalized vector addition then normalizing5.3

[Note: figures are indicative only.]

Finally, concatenating (and truncating if there were more than 10 words per text and padding if there were fewer) the word vectors for a sentenceand feeding it into an ANN produced an accuracy of 94.2%. Naive Bayes and Random Forest gave a similar results (92.3% and 93.3% respectively)

Note: multiplying each vector by a factor that was between 0.5 and 1.5 made no difference to the accuracy of a Spark  ANN. Weights will simply change accordingly. But "Neural Networks tend to converge faster, and K-Means generally gives better clustering with pre-processed features" (StackOverflow).

However, when I ran the same data on a similar TensorFlow implementation, my accuracy swung wildly between about 50% and 70%. Only when I normalized the data did TF give me an accuracy comparable to Spark's.


I had much better accuracy using category space rather than TF-IDF when classifying documents.

When I asked a Data Science PhD I work with which technique I should use (adding vectors; finding centroids, concatenating vectors etc) his answer was: "Yes".

Wednesday, May 23, 2018

Making the gradient

Gradient Descent is perhaps the simplest means of finding minima. The maths is pretty easy. We are trying to minimize the sum of squared errors of our target value and our algorithm's output. Let J(w) be our cost function which is a function of weights w and is based on the SSE. Then

J(w) = Σi(targeti - outputi)2 / 2

where i is the index of a sample in our dataset and we divide by 2 for convenience (it will soon disappear).

Say that

f(w) = target - output = y - wT . x

then using the chain rule, we can calculate:

dJ / dw = (dJ / df) . (df / dw)

if η is our step value then Δwj, the value by which we adjust w.

≈ -η (dJ / dw)
= -η Σi(targeti - outputi)(-xij)

the negative coming from the fact that we want to work in the opposite direction of the gradient as we are finding the minimum. As you can see, it cancels out the leading negative and note, too, that the divide-by-2 has also disappeared.

Now, the difference between GD and Stochastic Gradient Descent comes in how we implement the algorithm. In GD, we calculate wj for all i before moving to the next position. In SGD, we calculate wj for each sample, i, and then move on (see Quora).

Note that in TensorFlow you can achieve SGD by using a mini-batch size of 1 (StackOverflow).

Also note "It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize" (from here).

Gradient Descent vs. Newton's Methods

"Gradient descent maximizes a function using knowledge of its derivative. Newton's method, a root finding algorithm, maximizes a function using knowledge of its second derivative" (from StackOverflow).

"More people should be using Newton's method in machine learning... Of course, Newton's method will not help you with L1 or other similar compressed sensing/sparsity promoting penalty functions, since they lack the required smoothness" (ibid). The post includes a good chat about best way to find minima and how GD compares to Newton's methods.

LBFGS and Spark

Spark's default solver for a MultilayerPerceptronClassifier uses lbfgs (a memory efficient version of BFGS) which at least for my dataset is both fast and accurate. The accuracy looks like:

AlgorithmIterationsBatch SizeAccuracy (%)
gd10001 (ie, SGD)88.5

[Note: figures are indicative only.]

For my configuration (10 executors each with 12gb of memory and 3 cores plus a driver with 20gb of memory), l-bfgs was several times faster.

Interestingly, "there used to be l-BFGS implemented in standard TensorFlow but it was deleted because it never got robust enough. Because tensorflow distributed framework is quite low-level so people sometimes get surprised by how much work it takes to get something robust" (GitHub).