Regression Analysis
"A key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each unit change in an independent variable when you hold all of the other independent variables constant." [Multicollinearity in Regression Analysis]
Similarly, one-hot encoding by definition increases multicollinearity because if feature X has value 1, then I know that all the others have 0 "which can be problematic when you sample size is small" [1].
The linked article the describes how Variance Inflation Factors ("VIFs") can be used to identify multicollinearity.
"If you just want to make predictions, the model with severe multicollinearity is just as good!" [MiRA]
To remove or not?
The case for adding columns: Frost [1] has a good example on how excluding correlated features can give the wrong result. It describes how a study to test the health impact of coffee at first showed it was bad for you. However, this study ignored smokers. Since smokers are statistically more likely to be coffee drinkers, you can't exclude smoking from your study.
The case for removing columns: "P-values less than the significance level indicate that the term is statistically significant. When a variable is not significant, consider removing it from the model." [1]
Inferential Statistics
But what if we don't want to make a model? "Regression analysis is a form of inferential statistics [in which] p-values and coefficients are the key regression output." [1]
Note that the p-value is indicates whether we should reject the null hypothesis. It is not an estimate of how accurate our coefficient is. Even if the coefficient is large, if the p-value is also large "the observed difference ... might represent random error. If we were to collect another random sample and perform the analysis again, this [coefficient] might vanish." [1]
Which brings us back to multicollinearity as it "reduces the precision of the estimated coefficients, which weakens the statistical power of your regression model... [it] affects the coefficients and p-values."
["Statistical power in a hypothesis test is the probability that the test can detect an effect that truly exists." - Jim Frost]
[1] Regression Analysis, Jim Frost
No comments:
Post a Comment