Sunday, May 9, 2021

Higher Dimensional Distances

I was happily finding the Euclidean distance between vectors in a higher dimension feature space when my boss told me that Euclidean distance becomes less informative at greater dimensions - see the paper On the Surprising Behavior of Distance Metrics in High Dimensional Space.

Some more Googling shows this is a pretty well documented phenomena:

"Our intuitions, which come from a three-dimensional world, often do not apply in high-dimensional ones. In high dimensions, most of the mass of a multivariate Gaussian distribution is not near the mean, but in an increasingly distant “shell” around it; and most of the volume of a high-dimensional orange is in the skin, not the pulp. If a constant number of examples is distributed uniformly in a high-dimensional hypercube, beyond some dimensionality most examples are closer to a face of the hypercube than to their nearest neighbor. And if we approximate a hypersphere by inscribing it in a hypercube, in high dimensions almost all the volume of the hypercube is outside the hypersphere.  This is bad news for machine learning, where shapes of one type are often approximated by shapes of another." A Few Useful Things to Know About Machine Learning

We can show this with some code here. Given uniformally distributed points in a d-dimensional hypercube, the ratio of the difference between the largest and smallest separation to the overall largest gap behaves like this:

distance ratio over number of dimensions of uniformally distributed points

Alternatively stated, the minimum distance in a cluster of 100 points uniformally distributed compared to the maximum approaches 1 as the dimensions approach infinity:


Doomed?

But "signal-to-noise matters" [SO]. Note that the above examples used pure randomness in the data.

Instead of using random vectors, if we deliberately contrived two clusters and looked at the ratio of average distances within a cluster to the average distance between two clusters, the results look more like this:

Ratio of distances within vs distances between two clusters

where ratios quickly settle down. "Mapping your data set x,y→x,y,x,y,x,y,x,y,...,x,y increases representative dimensionality, but does not at all make Euclidean distance fail. (See also: intrinsic dimensionality) So in the end, it still depends on your data. If you have a lot of useless attributes, Euclidean distance will become useless. If you could easily embed your data in a low-dimensional data space, then Euclidean distance should also work in the full dimensional space." [ibid]

So, although my boss is technically correct, you may not have to worry about it. My algorithm went on to produce acceptable results.