Friday, December 3, 2021

More Logistic Regression

Log-Likelihood
How good is our model? I'm told the best way to think about the likelihood in a logistic regression is as the probability of having the data given the parameters. Compare this to what logistic regression is actually doing: giving the most probable parameters given the data.

First, some terminology:

Exo and Endo
"If an independent variable is correlated with the error term, we can use the independent variable to predict the error term, which violates the notion that the error term represents unpredictable random error... This assumption is referred to as exogeneity. Conversely, when this type of correlation exists, which violates the assumption, there is endogeneity." - Regression Analysis, Jim Frost.

"An endogenous variable is a variable in a statistical model that's changed or determined by its relationship with other variables within the model. In other words, an endogenous variable is synonymous with a dependent variable, meaning it correlates with other factors within the system being studied. Therefore, its values may be determined by other variables. Endogenous variables are the opposite of exogenous variables, which are independent variables or outside forces. Exogenous variables can have an impact on endogenous factors, however." [Investopedia]

Spark
Unfortunately, Spark does not seem to have a calculation for log-likelihood out of the box. This forced me to code my own. I looked at the code in the Python library, StatsModels, and converted it to PySpark.

Comparing the output from Spark was very data dependent. I guess this is inevitable since Spark uses IRLS [Wikipedia] as the solver and StatsModels was using l-BFGS. I was forced to use l-BFGS in StatsModels as I was otherwise "singular matrix" errors [SO].

But then, a passing data scientist helpfully pointed out that I was looking at the wrong StatsModel class (Logit in discreet_model.py not GLM in generalized_linear_model.py). Taking the GLM code, I got within 6% of StatsModels when both run on the same data. Could I do better? Well, first I rolled back the use of l-BFGS to the default solver (which is IRLS for both libraries) and removed a regularization parameter in the Spark code that had crept in from a copy-and-paste - oops. Now, the difference between the two was an impressive 4.5075472233299e-06. Banzai!

We may choose to have no regularization if we're looking for inferential statistics (which I am). This might lead to overfitting but that's OK. "Overfitting is predominantly an issue when building predictive models in which the goal is application to data not used to build the model itself... As far as other applications that are not predictive, overfitting is more secondary" [SO]

The Data
The size of data makes a difference. The more data, the lower the likelihood it is explained by the parameters. This is simply because the state space is bigger and that means there are more potentially wrong states.

Also, adding a bias of a single feature can change the log-likelihood significantly. Making a previously unimportant feature biased to a certain outcome 2/3 of the time reduced the LL in my data by an order of magnitude. Not surprisingly, if the data is constrained to a manifold, then it's more likely your model will find it.

No comments:

Post a Comment