Thursday, October 28, 2021

Logistic Regression Notes

Just a few, miscellaneous notes on logistic regression models that I'm using Spark to build for me.

Interpreting Coefficients

The coefficients for one-hot encoded features can simply be compared to show the strength of the effect. "Things are marginally more complicated for the numeric predictor variables. A coefficient for a predictor variable shows the effect of a one unit change in the predictor variable." [DisplayR]

"How does one interpret a coefficient of 0.081 (Std. Error = 0.026) for tobacco, for example? Tobacco is measured in total lifetime usage in kilograms, with a median of 1.0kg for the controls and 4.1kg for the cases. Thus an increase of 1kg in lifetime tobacco usage accounts for an increase in the odds of coronary heart disease of exp(0.081) = 1.084 or 8.4%. Incorporating the standard error we get an approximate 95% confidence interval of exp(0.081 ± 2 × 0.026) = (1.03, 1.14)." [1]

Computational Problems

How do we avoid problems with singular matrices? "it's computationally cheaper (faster) to find the solution using the gradient descent in some cases... it works even when the design matrix has collinearity issues." [StackExchange]

"Switching to an optimizer that does not use the Hessian often succeeds in those cases. For example, scipy's 'bfgs' is a good optimizer that works in many cases"[StackOverflow]

Generalized Linear Models

"Where the response variable has an exponential family distribution [Gaussian, Bernoulli, gamma etc], whose mean is a linear function of the inputs ... this is known as a generalized linear model, and generalizes the idea of logistic regression to other kinds of response variables." [Machine Learning-A Probabilistic Perspective - Murphy]

"Software implementations can take advantage of these connections. For example, the generalized linear modeling software in R (which includes logistic regression as part of the binomial family of models) exploits them fully. GLM (generalized linear model) objects can be treated as linear model objects, and all the tools available for linear models can be applied automatically." [1]

Z-Scores

Z-Scores are "coefficients divided by their standard errors. A nonsignificant Z score suggests a coefficient can be dropped from the model.  Each of these correspond formally to a test of the null hypothesis that the coefficient in question is zero, while all the others are not (also known as the Wald test). A Z score greater than approximately 2 in absolute value is significant at the 5% level." [1]

Although this sounds like the job of a p-score, note that they are different (but related) to z-scores. A "p-value indicates how unlikely the statistic is. z-score indicates how far away from the mean it is. There may be a difference between them, depending on the sample size." [StackOverflow]

You can see quite clearly that jiggling sample size can profoundly change the numbers of the metrics. It's very useful to be able to quickly create fake but realistic data and play around and watch this effect. Given a large amount of data, you might want to reduce the p-value cut off since "in large samples is more appropriate to choose a size of 1% or less rather than the 'traditional' 5%." [SO]

[1] Elements of Statistical Learning

Saturday, October 23, 2021

Miscellaneous Spark

Feature names in Spark ML

Why isn't this better known? This is a great way to extract the features names from a Dataset after the features have been turned into vectors (including even one-hot encoding):

    index_to_name = {}
    for i in df.schema[OUTPUT_COL].metadata["ml_attr"]["attrs"]:
        meta_map = df.schema[OUTPUT_COL].metadata["ml_attr"]["attrs"][i]
        for m in meta_map:
            index = m['idx']
            name = m['name']
            index_to_name[index] = name 

where OUTPUT_COL is the column containing a Vector of encoded features.

Non-deterministic Lists

I've seen this Spark-related bug twice in our code base: even though lists have an order, collect_list "is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle." [StackOverflow]

The solution in the SO answer is to run the collect over a Window that performs an orderBy.