Saturday, February 2, 2019

Statistical Covariance



The meaning of statistical covariance is completely different to that used in programming languages but it's very important to data scientists. Here are some miscellaneous notes I made playing around.

The covariance matrix

... can be calculated for matrix X with something like this:

    x = X - np.mean(X, axis = 0)
    C = np.dot(x, x.T) / (n - 1)

For x, we're just taking the means of each column and subtracting each mean from each element in its respective column.

Note that in the diagram above, the three vectors from the covariance all live in the same plane. This is not a coincidence. The covariance matrix “is ALWAYS singular for ANY square matrix because you subtract off the column means. This guarantees that you reduces the rank by one (unless it is already singular) before multiplying the matrix with its transpose.” (from MathWorks).

Take this R code:

NROW = 10 NCOL = 10
res <- rep(NA, 1000)
for (i in 1:1000) {
   x <- matrix(runif(NROW*NCOL), ncol = NCOL, nrow = NROW)
   C <- cov(x)
   res[i] <- det(C)
}
hist(res, breaks = 100, main="det(C)")



and look at the histogram:

The reason it's not zero all the time is simply rounding errors. You can see from the spread that it should really be zero.

Anyway, there are many properties of an NxN matrix whose determinant is zero that are all equivalent (here are some) but the one we are interested in is that "the columns of the matrix are dependent vectors in ℝN ".

The proof in 2-D looks like this. Take the matrix:

a
c
b
d

The means of the columns are:

μ1 = (a+b)/2
μ2 = (c+d)/2

So, centering this matrix results in:

a-μ1
c-μ2
b-μ1
d-μ2

Substituting in the values for μ1 and μ2 gives:

(a-b)/2
(c-d)/2
(b-a)/2
(d-c)/2

Multiplying this matrix by its transpose gives:

1/4 
(a-b)2+(c-d)2
-(a-b)2(c-d)2
-(a-b)2-(c-d)2
(a-b)2+(c-d)2

and the determinant is therefore:

(1/42) [ ((a-b)2+(c-d)2)2 - ((a-b)2+(c-d)2)2 ] = 0

The matrix is always singular meaning necessarily that there is linear dependency among the vectors.

This generalizes to higher dimensions.

Note that we might choose the correlation matrix rather than the covariance matrix (see StackOverflow).

Note also that the “units [of covariance] are the product of the units of X and Y. So the covariance of weight and height might be in units of kilogram-meters, which doesn’t mean much.” [Think Stats p108].


Relationship with Cosine Similarities and Pearson Correlation

The triangle inequality is a defining property of norms and measures of distance. It simply says that for vectors x, y and z that make a triange, x + y <= z. It is a consequence of the law of cosines and a defining property of distances, that is, functions that satisfy:

non-negativity:      f(x, y)   > 0
identity:            f(x,y)    = 0 means x =y
symmetry:            f(x,y)    = f(y,x)
triangle-inequality: f(x,z)    <= f(x,y) + f(y,z)

Only symmetry is true for cosine similarities therefore it is not a true distance  although they can be converted to one (see Wikipedia). Despite this, we can still use it for comparing data.

"Cosine similarity is not invariant to shifts. If x was shifted to x+1, the cosine similarity would change. What is invariant, though, is the Pearson correlation" [1] which is very similar. The relationship between the two looks like this:

Pearson correlation = cosine similarity (x - x̄, y - ȳ)

"People usually talk about cosine similarity in terms of vector angles, but it can be loosely thought of as a correlation, if you think of the vectors as paired samples." [1]

Pearson correlation is the normalized form of the covariance. Or, to put it in Dirac notation:

covariance = <x - x̄, y - ȳ> / n

where n=2 as we're finding the mean which the same as:

cov(X, Y) = E[(X - E[X])(Y - E[Y])]

And so all these concepts are related.

[1] Brendan O'Connor's blog

No comments:

Post a Comment