Monday, June 4, 2018

Manifold Destiny


Manifolds are "'spaces' of points, together with a notion of distance that looked like Euclidean distance on small scales but which could be quite different at larger scales." [1]

"Although there is a formal mathematical meaning to the term manifold, in machine learning it tends to be used more loosely to designate a connected set of points that can be approximated well by considering only a small number of degrees of freedom, or dimensions, embedded in a higher-dimensional space... In the context of machine learning, we allow the dimensionality of the manifold to vary from one point to another. This often happens when a manifold intersects itself. For example, a figure eight is a manifold that has a single dimension in most places but two dimensions at the intersection at the center.

"Many machine learning problems seem hopeless if we expect the machine learning algorithm to learn functions with interesting variations across all of ℝn. Manifold learning algorithms surmount this obstacle by assuming that most of ℝn consists of invalid inputs, and that interesting inputs occur only along a collection of manifolds containing a small subset of points, with interesting variations in the output of the learned function occurring only along directions that lie on the manifold, or with interesting variations happening only when we move from one manifold to another. Manifold learning was introduced in the case of continuous-valued data and the unsupervised learning setting, although this probability concentration idea can be generalized to both discrete data and the supervised learning setting: the key assumption remains that probability mass is highly concentrated."

The manifold hypothesis is the observation "that the probability distribution over images, text strings, and sounds that occur in real life is highly concentrated." [2]

Transforming the data

"The most logical way to transform hour is into two variables that swing back and forth out of sink. Imagine the position of the end of the hour hand of a 24-hour clock. The x position swings back and forth out of sink with the y position. For a 24-hour clock you can accomplish this with x=sin(2pi*hour/24),y=cos(2pi*hour/24).

"You need both variables or the proper movement through time is lost. This is due to the fact that the derivative of either sin or cos changes in time where as the (x,y) position varies smoothly as it travels around the unit circle.

"Finally, consider whether it is worthwhile to add a third feature to trace linear time, which can be constructed my hours (or minutes or seconds) from the start of the first record or a Unix time stamp or something similar. These three features then provide proxies for both the cyclic and linear progression of time e.g. you can pull out cyclic phenomenon like sleep cycles in people's movement and also linear growth like population vs. time" (StackExchange).

Normalizing the data

"If the input variables are combined linearly, as in an MLP, then it is rarely strictly necessary to standardize the inputs, at least in theory. The reason is that any rescaling of an input vector can be effectively undone by changing the corresponding weights and biases, leaving you with the exact same outputs as you had before. However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs" ( FAQ).

Just how normalization and standardization can effect the overall accuracy for different optimizers can be seen on Rinat Maksutov's blog here.

In general, it improves performance rather than making a difference. "It is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface" (StackOverflow).

Doing just the opposite, multiplying all elements in all vectors by a random factor between 0.5 and 1.5 made no difference to the accuracy (94.5%) of an ANN in the "20 Newsgroups" data set. By comparison, the choice of optimizer made a huge difference (15.6% for gradient descent irrespective of whether the vectors are normalized or not. The default optimizer for Spark is l-BFGS).

[1] The Princeton Companion to Mathematics
[2] Deep Learning (Goodfellow, Bengio, Courville)

No comments:

Post a Comment