I've demonstrated in previous posts that pre-processing our data improved our accuracy more than hyperparameter tuning. But "preprocessing in machine learning is somewhat a very black art" as this StackExchange post says. "It is not written down in papers a lot why several preprocessing steps are essential to make it work... To make things more complicated, it depends heavily on the method you use and also on the problem domain."
All the answers on this page are enlightening - that given 4 points for age and height of people, the clustering is different if we use metric or imperial measurements since the distance between them will be different; that normalizing tends to make clusters in K-Means more distinct; that "if you have a neural network and just apply an affine transformation [x' = ax + b] to your data, the network does not lose or gain anything in theory. In practice, however, a neural network works best if the inputs are centered and white." There is a caveat for regularization.
I looked into this last point and did some (non-exhaustive) experiments. Results are from Spark's MultilayerPerceptronClassifier using 100 epochs. As ever, results are indicative and not to be taken too seriously (for instance, a difference of a percent may just be noise).
|Word Vector||Sentence Vector||Feature Vector||Accuracy (%)|
All sentence vectors were constructed from simply adding the relevant word vectors together.
Although I spent less time on TensorFlow, I also noticed that pre-processing was important there too. Using the same architecture, unnormalized data gave me about 50% accuracy while L2 normalized words gave me 93% accuracy. Neither accuracy seemed to be increasing after the 1000 epochs I let them run for.
Changing tack completely, I also one-hot encoded my words to form sentence vectors and achieved a respectable but not stellar accuracy of 84%.
I thought that perhaps I was helping the neural net by pre-processing but it would still learn anyway if given enough time. This appears not to be true. Taking the data with no processing of the word, sentence and column vectors, the accuracy with increasing number of epochs looked like this:
|Number of epochs||Accuracy (5)|
This demonstrates a clear plateau in accuracy at about 85% when we know we can achieve an accuracy as high as 95% using the same classifier and the same data (if rendered somewhat differently).