Tuesday, May 29, 2018

Neural Net Architecture


Some more notes on the different types and aspects of neural net architecture - what to use and when.

Multilayer Perceptron

"Common columnar data has a static structure to it, and is best modeled in DL4J by a classic multilayer perceptron neural network. These problems may benefit from minor feature engineering, but we can often let the network find the best weights on the dataset on its own. Hyperparameter tuning is one of the main challenges when modeling with MLPs." [1]

CNNs

"While we recommend the practitioner start with CNNs for image modeling problems, application of these networks have begun to appear for text modeling problems as well." [1] For example machine translation, sentence classification and sentiment analysis. "CNNs basically do the feature extraction on their own" (StackOverflow).

Batch Normalization

"Normalization is a broad category of methods that seek to make different samples seen by a machine-learning model more similar to each other, which helps the model learn and generalize well to new data. The most common form of data normalization is ... centering the data on 0 by subtracting the mean from the data, and giving the data a unit standard deviation by dividing the data by its standard deviation. In effect, this makes the assumption that the data follows a normal (or Gaussian) distribution and makes sure this distribution is centered and scaled to unit variance

"data normalization should be a concern after every transformation operated by the network... The main effect of batch normalization is that it helps with gradient propagation—much like residual connections—and thus allows for deeper networks." [2]

Bottleneck Layer

"A bottleneck layer is a layer that contains few nodes compared to the previous layers. It can be used to obtain a representation of the input with reduced dimensionality. An example of this is the use of autoencoders with bottleneck layers for nonlinear dimensionality reduction" (StackOverflow).

Recurrent Neural Networks

"Recurrent Neural Networks are in the family of feed-forward neural-networks. They differ from other feed-forward networks in their ability to send information over time-steps... Modeling the time dimension is a hallmark of Recurrent Neural Networks." [1]

"That is, a feedforward network has no notion of order in time, and the only input it considers is the current example it has been exposed to. Feedforward networks are amnesiacs regarding their recent past; they remember nostalgically only the formative moments of training.
Recurrent networks, on the other hand, take as their input not just the current input example they see, but also what they have perceived previously in time" (DeepLearning4J).

RNNs do this by adding their output in time step t-1 to the input in time step t.

LSTMs

"A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor... This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data... Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version." (Chris Olah's blog).

Hybrid

"When we see data that is the combination of both time and image (eg video) we use a special hybrid network of Long Short-Term Memory (LSTM) and convolutional layers.

"Generally, you want to use multiple epochs and one iteration (.iterations(1) option) when training; multiple iterations are generally only used when doing full-batch training on very small data sets" (DeepLearning4J).

"If the input is a fixed-length sequence (ie all time-series are the same length), you might as well use an  MLP or CNN." [1]

Activation Functions

"As long as the activation function is something nonlinear, a neural network with at least one hidden layer can approximate arbitrary functions... The nonlinear neural network model with a hidden layer ... is flexible enough to approximately represent any function! What a time to be alive!" [3] This is called the Universal Approximation Theorem.

See Chris Olah's blog: "Each layer stretches and squishes space, but it never cuts, breaks, or folds it... Transformations like this, which don’t affect topology, are called homeomorphisms. Formally, they are bijections that are continuous functions both ways...  Layers with NN inputs and NN outputs are homeomorphisms, if the weight matrix, WW, is non-singular."

"If you do not put an activation function between two layers, then two layers together will serve no better than one, because their effect will still be just a linear transformation... The reason why people use ReLU between layers is because it is non-saturating (and is also faster to compute). Think about the graph of a sigmoid function. If the absolute value of x is large, then the derivative of the sigmoid function is small, which means that as we propagate the error backwards, the gradient of the error will vanish very quickly as we go back through the layers. With ReLU the derivative is 1 for all positive inputs, so the gradient for those neurons that fired will not be changed by the activation unit at all and will not slow down the gradient descent.

"For the last layer of the network the activation unit also depends on the task. For regression you will want to use the sigmoid or tanh activation, because you want the result to be between 0 and 1. For classification you will want only one of your outputs to be one and all others zeros, but there's no differentiable way to achieve precisely that, so you will want to use a softmax to approximate it" (from StackOverflow).

"You don’t have to worry too much about which activation function is better under what circumstances. That’s still an active research topic... Usually, the best one is chosen by using cross-validation to determine which one gives the best model, given the dataset you’re working with." [3]

[1] Deep Learning: A Practitioner's Approach
[2] Deep Learning with Python
[3] Machine Learning with TensorFlow

Thursday, May 24, 2018

Vector Space, the Final Frontier


Given vectors that represent words, how do we construct sentences? Do we add the vectors? Do we find centroids? Do we normalize before, after or not at all?

In fact, can we even say we are dealing with a vector space?

Remember, a vector space has the following 8 properties:
  • Identity (of addition and multiplication)
  • Distributivity (of scalars and vectors)
  • Addition also has commutivity and associativity plus an inverse.
  • Compatibilty: (ab)v = a(bv)
But as pointed out on StackOverflow:

"The commutativity property of vector addition does not always hold in semantics. Therefore, this property shouldn't (always) hold in the embedding space either. Thus, the embedding space should not be called a vector space.

E.g. attempt to treat semantic composition as vector addition in the vector space:

vrescue dog = v rescuevdog (a dog which is trained to rescue people)

vdog rescue = vdogvrescue (the operation of saving a dog)

The phrases "rescue dog" and "dog rescue" mean different things, but in our hypothetical vector space, they would (incorrectly) have the same vector representation (due to commutativity).

Similarly for the associativity property."

At the same time, erroneous assumptions are not necessarily unacceptable (as the post points out). It's just a high-bias model.

Different Spaces

For fun, I tried the vectors that Word2Vec gave me. Now, there is no reason I could think of why the vectors this algorithm gives me for words should be used to form a sentence. But the results were surprising.

Word2Vec
DescriptionAccuracy (%)
Raw vector addition81.0
Normalized vector addition27.9
Raw vector centroids7
Raw vector addition then normalizing7

[Note: figures are indicative only.]

That is, adding together Word2Vec generated word vectors to make a sentence meant my neural net produced decent (but not stellar) results.

More promising was combining vectors from category space. The results looked like this:

Category Space
DescriptionAccuracy (%)
Normalized vector addition94.0
Normalized vector centroids92.1
Adding unnormalized vectors6.2
Normalized vector addition then normalizing5.3

[Note: figures are indicative only.]

Finally, concatenating (and truncating if there were more than 10 words per text and padding if there were fewer) the word vectors for a sentenceand feeding it into an ANN produced an accuracy of 94.2%. Naive Bayes and Random Forest gave a similar results (92.3% and 93.3% respectively)

Note: multiplying each vector by a factor that was between 0.5 and 1.5 made no difference to the accuracy of a Spark  ANN. Weights will simply change accordingly. But "Neural Networks tend to converge faster, and K-Means generally gives better clustering with pre-processed features" (StackOverflow).

However, when I ran the same data on a similar TensorFlow implementation, my accuracy swung wildly between about 50% and 70%. Only when I normalized the data did TF give me an accuracy comparable to Spark's.

Conclusion

I had much better accuracy using category space rather than TF-IDF when classifying documents.

When I asked a Data Science PhD I work with which technique I should use (adding vectors; finding centroids, concatenating vectors etc) his answer was: "Yes".


Wednesday, May 23, 2018

Making the gradient


Gradient Descent is perhaps the simplest means of finding minima. The maths is pretty easy. We are trying to minimize the sum of squared errors of our target value and our algorithm's output. Let J(w) be our cost function which is a function of weights w and is based on the SSE. Then

J(w) = Σi(targeti - outputi)2 / 2

where i is the index of a sample in our dataset and we divide by 2 for convenience (it will soon disappear).

Say that

f(w) = target - output = y - wT . x

then using the chain rule, we can calculate:

dJ / dw = (dJ / df) . (df / dw)

if η is our step value then Δwj, the value by which we adjust w.

≈ -η (dJ / dw)
= -η Σi(targeti - outputi)(-xij)

the negative coming from the fact that we want to work in the opposite direction of the gradient as we are finding the minimum. As you can see, it cancels out the leading negative and note, too, that the divide-by-2 has also disappeared.

Now, the difference between GD and Stochastic Gradient Descent comes in how we implement the algorithm. In GD, we calculate wj for all i before moving to the next position. In SGD, we calculate wj for each sample, i, and then move on (see Quora).

Note that in TensorFlow you can achieve SGD by using a mini-batch size of 1 (StackOverflow).

Also note "It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize" (from here).


Gradient Descent vs. Newton's Methods

"Gradient descent maximizes a function using knowledge of its derivative. Newton's method, a root finding algorithm, maximizes a function using knowledge of its second derivative" (from StackOverflow).

"More people should be using Newton's method in machine learning... Of course, Newton's method will not help you with L1 or other similar compressed sensing/sparsity promoting penalty functions, since they lack the required smoothness" (ibid). The post includes a good chat about best way to find minima and how GD compares to Newton's methods.


LBFGS and Spark

Spark's default solver for a MultilayerPerceptronClassifier uses lbfgs (a memory efficient version of BFGS) which at least for my dataset is both fast and accurate. The accuracy looks like:

AlgorithmIterationsBatch SizeAccuracy (%)
l-bfgs10006494.4
l-bfgs1003294.4
l-bfgs10012894.0
gd200012892.4
gd100012890.0
gd10001 (ie, SGD)88.5
gd10012815.6

[Note: figures are indicative only.]

For my configuration (10 executors each with 12gb of memory and 3 cores plus a driver with 20gb of memory), l-bfgs was several times faster.

Interestingly, "there used to be l-BFGS implemented in standard TensorFlow but it was deleted because it never got robust enough. Because tensorflow distributed framework is quite low-level so people sometimes get surprised by how much work it takes to get something robust" (GitHub).

Monday, May 7, 2018

Data Wrangling in Textual Classification


You get out what you put it

Since all my machine learning models were giving me similar but modest accuracy, I looked at playing with the data that is fed into them. This made an enormous difference to the accuracy achieved.

Just a reminder of my results, the top five models were giving an accuracy of about 81 +/- 5%. They all used a TF-IDF to create a DTM (document-term matrix) where "the entire document is thus a feature vector, and each feature vector corresponds to a point in a vector space" (StackOverflow).

There was very little improvement using bagging (although, admittedly, I am told that as a rule of thumb one should use as many different models in bagging as one has classes and this I did not do).

The great leap forward came in representing the training data instead as a term-class matrix. In each case, we used a bag-of-words approach on the corpus ("bag-of-words refers to what kind of information you can extract from a document namely, unigram words" - ibid)

Word2Vec

Firstly, I tried using Spark's Word2Vec implementation. This is a word-embedding technique. This is "a means of building a low-dimensional vector representation from corpus of text, which preserves the contextual similarity of words" (Quora).

[If you read much about Word2Vec, you will see the expression distributed vector used frequently. "In distributed representations of words and paragraphs, the information about the word is distributed all along the vector. So, every position in the vector may be non-zero for a given word.  Contrast this with the one-hot encoding of words, where the representation of a word is all 0s except for a 1 in one position for that word" (Quora).]

Vered Shwartz says: "It was common to apply dimensionality reduction to the word co-occurrence matrix (count based vectors) to get low-dimensional word vectors. They perform similarly or slightly worse than word embeddings (it depends who you ask...)". She then cites two different papers that found radically different results.

I tried using Word2Vec embeddings on a Spark neural net for the "20 Newsgroups" data but only got the monkey score (5.02%) with lots of "StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to ..." WARNings originating in optimize.LBFGS code. I do not yet know why it is complaining (UPDATE: I was normalizing the resulting sentence vector even though the word vectors had already been normalized)

Interestingly, Random Forest gave me 82.6% on exactly the same data. Nice but not a massive improvement so I concluded using Word2Vec did not significantly effect the results.

Class-Term Matrix

Using a matrix where the rows are the classes rather than the documents and the columns are still word vectors (representing the probability of this word being in a given class) resulted much better predictions. Spark's neural nets, random forest and naive Bayes all achieved an accuracy of about 93 +/- 0.5%.

Note, the words were counted manually without using Spark's IDF class. Because of a lack of collisions using this technique, about 200 more words were captured (about 2% extra) but this did not seem to make any difference. The difference appears to entirely be due to adding the word vectors (with only 20 elements) together for each sentence in the vector space model.

Also note that the columns (ie, the terms) were all normalized (rather than scaled or standardized see here for the differences). That is, they represented the (frequentist's) probability of that term being in any given class. If these vectors were not normalized (ie, they remained the count of that term being in any given class) then the accuracy of the neural net dropped back down to about 81%.

Finally, using this representation of the feature vectors lead to the KMean algorithm giving more sensible results (an average accuracy per cluster of 86.4%).

Conclusion

It was explained to me (by a machine learning PhD) that the improvement in accuracy that comes with a term-class matrix is a result of using an input model that better approximates what you want as an output. "TF-IDF combines the importance-per-document and the importance-per-corpus introducing other information not necessarily supporting your classification problem directly," he said. "TF-IDF was developed for Information Retrieval."


Tuesday, May 1, 2018

Wood for the trees


I'd heard Random Forests generally give good results but the accuracy I was seeing was bad. After some head scratching, things improved enormously.

What are Random Forests?

"A random forest is simply a collection of decision trees whose results are aggregated into one final result. Their ability to limit overfitting without substantially increasing error due to bias is why they are such powerful models" (from TowardsDataScience).

Spark's RandomForestClassifier

The matrix for the 20 Newsgroups data was generated using TF-IDF. Interestingly, if all data is used, the accuracy from Spark's RF implementation is the monkey score (roughly 5%).

However, if we insist that a word must appear in at least 20 documents the accuracy was approximately 25%.

Changing the Random Forest hyperparameters for minDocFreq=1 only achieved a limited increase in accuracy; setting the number of trees hyperparameters to [10, 100, 100] gave an accuracy percentage of [10.3, 27.3, 38.3].

Removing columns with no non-zero values in them (as outlined here) managed to increase the accuracy to between 42% and 48%. But no matter what hyperparameter tuning I did, there it stayed.

What using using minDocFreq=25 and removing empty columns had in common was that they resulted in TF-IDF matrices that were less sparse.

"There is no [Random Forest] implementation for sparse data in R. Partially because RF does not fit very well on this type of problem -- bagging and suboptimal selection of splits may waste most of the model insight on zero-only areas.  Try some kernel method or better think of converting your data into some more lush representation with some descriptors (or use some dimensionality reduction method)." (from StackOverflow).

(See On the Surprising Behaviour of Distance Metrics in High Dimensional Space for more information).

So, let's go all the way and using Singular Value Decomposition. This reduced my TF-IDF from having about 10 000 features to a mere 400. This figure of 400 was found by just a manual binary search that attempted to maximize the accuracy. Interestingly, the eigenvalue "elbow" can be seen in this scree plot:



After this (and some more cross validation that said my tree size should be about 190), accuracy was reaching about 81%.

Note, you may need to call setMaxMemoryInMB to something quite high otherwise executors will die for no apparent reason. In fact, you'll see nothing special when you call jstat -gc on the executors.

A Matter of Interpretation

One very nice thing about tree models is that they're easy to interpret compared to say neural nets.

Spark gives a very nice method (featureImportances) that tells you which features were the most important when generating the model.

Gradient Boosted Trees

Note that Spark's facility for boosting trees is currently limited. In Spark, "GBTs do not yet support multiclass classification. For multiclass problems, please use decision trees or Random Forests." (from the docs).