Sunday, November 8, 2020

Statistical Covariance Part 2

I quoted in a previous post somebody who said that covariance matrices always have a determinant of 0. This isn't strictly true. What can I say? I'm not a very good mathematician.

The argument was that the means are always substracted from each row and with simple algebra you could demonstrate that the determinant is 0. But this ignores the fact that covariance is the expected value of two rows having their expected values subtracted. 

Say we have two different distributions from which we draw combinations. However, for reasons peculiar to the use case, certain combinations are not allowed. When we tabulate the probabilities for this state space, we'll have zeros in some cells. Even though the inner product of the corresponding two probability vectors might not be zero, the expected probability of the two together is.

Note another caveat. Bearing in mind that correlation is just covariance divided by the root of the product of both variances, "the correlation between A and B is only a measure of the strength of the linear relationship between A and B. Two random variables can be perfectly related, to the point where one is a deterministic function of the other, but still have zero correlation if that function is non-linear." [Daniel Shiebler's blog]

Positive Definite

A matrix is positive definite if xT M x > 0 (an equivalent definition is "a symmetric matrix whose eigenvalues are all positive real numbers" - Coding the Matrix, Klein, definition 12.6.1) 

Note that all matrices that can be expressed as ATA are positive semi definite (see this SO answer). The proof is simple: substitute ATA for M above. 

xT M x = xT ATA x = (x A)T(A x)  

and any non-zero vector or real number multiplied by itself is positive. Since our covariance matrix can be expressed as ATA it too is at least positive semidefinite (xT M x ≥ 0). But we know it must also be positive definite as you can't invert M x = 0 for x ≠ 0.

Why this last statement is true can be explained at this elegant SO answer. Basically, if M x = 0 for x ≠ 0 then each row of M must be linarly dependent on each other for the equation 

M i,j xj = 0 ∀ i 

to hold. That is, if you give me (n-1) values in a row, I can tell you the value of the last one. Here's a simple Python/Numpy example where the last value is a row is just the sum of the first two. It's easy to see that when multipled with the vector [1, 1, -1] this matrix would be in the null space:

>>> M = np.asmatrix([[1, 2, 3], [4, 5, 9], [6, 7, 13]])
>>> np.linalg.det(M)
0.0

But if a matrix is linearly dependent, it's determinant must be 0. And if a matrix's determinant is 0, it cannot be inverted. QED.

What if my Matrix is Singular?

We add a tiny amount to make it non-singular. This is called conditioning. It sounds like a hack but it does have a basis in maths. In a frequentist interpretation, this is ridge regression. In a Bayesian interpretation, it's the prior.

Briefly, the argument goes that for ridge regression, we penalize large model parameters. So, instead of minimizing our error (θ X - yactual) we minimize our error plus the penalty:

yestimate = (θ X - yactual)2 + λ θT θ

by differentiating with respect to θ. Solve this equation and you'll see a [XT X + λ I]-1.

The argument for the Bayesian prior briefly goes like this: if we take a frequentist view and assume that the error in our data is Gaussian and plug yestimate into it, we'll see our familiar equation for a Gaussian multiplied by eλθTθ. Since the Bayesian posterior,  p(θ|Data,Model) must equal the frequentist probability, eλθTθ is the only term that maps to p(θ|Model) simple because it's the only one with θ in it. Therefore, our conditioning manifests itself in the prior.

Full derivations appear in Jake VanderPlas' wonderful blog.

No comments:

Post a Comment