Agile Java Man: To Normalize or Not?

Monday, June 18, 2018

To Normalize or Not?

I've demonstrated in previous posts that pre-processing our data improved our accuracy more than hyperparameter tuning. But "preprocessing in machine learning is somewhat a very black art" as this StackExchange post says. "It is not written down in papers a lot why several preprocessing steps are essential to make it work... To make things more complicated, it depends heavily on the method you use and also on the problem domain."

Clearly, I'm not the only person who is confused as this StackOverflow questions makes clear.

All the answers on this page are enlightening - that given 4 points for age and height of people, the clustering is different if we use metric or imperial measurements since the distance between them will be different; that normalizing tends to make clusters in K-Means more distinct; that "if you have a neural network and just apply an affine transformation [x' = ax + b] to your data, the network does not lose or gain anything in theory. In practice, however, a neural network works best if the inputs are centered and white." There is a caveat for regularization.

I looked into this last point and did some (non-exhaustive) experiments. Results are from Spark's MultilayerPerceptronClassifier using 100 epochs. As ever, results are indicative and not to be taken too seriously (for instance, a difference of a percent may just be noise).

Word Vector	Sentence Vector	Feature Vector	Accuracy (%)
L1	None	Standardized	95
L1	L2	Standardized	95
L2	L2	Standardized	94
L2	None	None	94
L1	None	None	94
L1	L2	None	94
None	L2	Standardized	84
None	None	None	81
None	None	Standardized	81
None	L2	None	78
None	L1	None	77
None	L2	L1	8
L1	None	L1	7
None	None	L1	6
Standardized	None	None	6
L2	L2	L1	5
L2	None	L1	3

All sentence vectors were constructed from simply adding the relevant word vectors together.

Although I spent less time on TensorFlow, I also noticed that pre-processing was important there too. Using the same architecture, unnormalized data gave me about 50% accuracy while L2 normalized words gave me 93% accuracy. Neither accuracy seemed to be increasing after the 1000 epochs I let them run for.

Changing tack completely, I also one-hot encoded my words to form sentence vectors and achieved a respectable but not stellar accuracy of 84%.

I thought that perhaps I was helping the neural net by pre-processing but it would still learn anyway if given enough time. This appears not to be true. Taking the data with no processing of the word, sentence and column vectors, the accuracy with increasing number of epochs looked like this:

Number of epochs	Accuracy (5)
100	80.9
200	82.6
1000	84.7
2000	85.3
5000	85.3

This demonstrates a clear plateau in accuracy at about 85% when we know we can achieve an accuracy as high as 95% using the same classifier and the same data (if rendered somewhat differently).

Agile Java Man

Monday, June 18, 2018

To Normalize or Not?

No comments:

Post a Comment

Blog Archive

About Me