Tuesday, June 18, 2019

Signal from noise - part 1


PCA

How do you mathematically determine there is a signal in a scatter plot? A non-maths colleague said "just use principle component analysis", but it's much harder than that.

First, how would you even represent the data? Since it's a 2-D scatter plot, you might represent the data as an Nx2 matrix, where N is the number of points. You could then use singular value decomposition to calculate the principle components as SVD and PCA are related [StackOverflow].

Recall that if our matrix is called X,

(U, S, VT) = SVD(X)

Note that the "Principal components are given by 𝐗𝐕=𝐔𝐒𝐕T𝐕=𝐔𝐒." (ibid)

Now, we can reconstruct our original X by the matrix product, 𝐔𝐒𝐕T. We can see a real example of this in my GitHub repository here. Full reconstruction looks like this:
As you can see, I've added a sinusoidal shape to an otherwise random collection of points.

Now, we have only 2 principle components (as it's an Nx2 matrix) so let's see what it's like when reconstruct with just one PC:
Only the first Principle Component

Only the second Principle Component

Data Isomorphism

OK, so looking at just one principle component in this situation doesn't tell us much. So, let's rephrase the question. Let's treat the scatter plot as IxJ matrix that holds either ones or zeros. This representation holds exactly the same information as before but this time it will have I or J principle components.

Here, I've changed the visualisation from a scatter plot to a heat map since the reconstructed matrix may not be just ones and zeros if we discard some PCs. Here, is a reconstruction with only the top quartile of principle components:
Reconstruction discarding the bottom 75% of Principle Components.
The sinusoidal "signal" is still evident in the reconstruction but less so. Unfortunately, PCA has not helped us here to identify what we'd regard as the dominant feature of the data.

Conclusion

The principle components in a mathematical sense are not necessarily the the principle features that a human brain may identify.

There may be better ways to identify patterns in data than PCA. In this particular case, I got much more mileage from Convolutional Neural Nets (CNNs). I'll document that in another post.

No comments:

Post a Comment