Monday, June 18, 2018

To Normalize or Not?


I've demonstrated in previous posts that pre-processing our data improved our accuracy more than hyperparameter tuning. But "preprocessing in machine learning is somewhat a very black art" as this StackExchange post says. "It is not written down in papers a lot why several preprocessing steps are essential to make it work... To make things more complicated, it depends heavily on the method you use and also on the problem domain."

Clearly, I'm not the only person who is confused as this StackOverflow questions makes clear.

All the answers on this page are enlightening - that given 4 points for age and height of people, the clustering is different if we use metric or imperial measurements since the distance between them will be different; that normalizing tends to make clusters in K-Means more distinct; that "if you have a neural network and just apply an affine transformation [x' = ax + b] to your data, the network does not lose or gain anything in theory. In practice, however, a neural network works best if the inputs are centered and white." There is a caveat for regularization.

I looked into this last point and did some (non-exhaustive) experiments. Results are from Spark's MultilayerPerceptronClassifier using 100 epochs. As ever, results are indicative and not to be taken too seriously (for instance, a difference of a percent may just be noise).

Word VectorSentence VectorFeature VectorAccuracy (%)
L1NoneStandardized95
L1L2Standardized95
L2L2Standardized94
L2NoneNone94
L1NoneNone94
L1L2None94
NoneL2Standardized84
NoneNoneNone81
NoneNoneStandardized81
NoneL2None78
NoneL1None77
NoneL2L18
L1NoneL17
NoneNoneL16
StandardizedNoneNone6
L2L2L15
L2NoneL13

All sentence vectors were constructed from simply adding the relevant word vectors together.

Although I spent less time on TensorFlow, I also noticed that pre-processing was important there too. Using the same architecture, unnormalized data gave me about 50% accuracy while L2 normalized words gave me 93% accuracy. Neither accuracy seemed to be increasing after the 1000 epochs I let them run for.

Changing tack completely, I also one-hot encoded my words to form sentence vectors and achieved a respectable but not stellar accuracy of 84%.

I thought that perhaps I was helping the neural net by pre-processing but it would still learn anyway if given enough time. This appears not to be true. Taking the data with no processing of the word, sentence and column vectors, the accuracy with increasing number of epochs looked like this:

Number of epochsAccuracy (5)
10080.9
20082.6
100084.7
200085.3
500085.3

This demonstrates a clear plateau in accuracy at about 85% when we know we can achieve an accuracy as high as 95% using the same classifier and the same data (if rendered somewhat differently).


No comments:

Post a Comment