Monday, January 31, 2022

Spark Log Likelihood

Log-likelihood

I wanted to see how the number of iterations in a Generealized Linear Model effected the performance of my mode. One way to do this is with log-likelihood. "Probability is used before data are available to describe plausibility of a future outcome, given a value for the parameter. Likelihood is used after data are available to describe plausibility of a parameter value." [Wikipedia]. In Bayesian inference, if P(C|D) is the probability, P(D|C) is the likelihood. We take the logs to make the maths easier.

Spark

My model was built using Spark's ML library. With it, I was trying to model the number of hours spent waiting in A&E. If the patient waits more than 4 hours, they have breached our target. The code looks like this:

from pyspark.ml.regression import GeneralizedLinearRegression
glr = GeneralizedLinearRegression(family="binomial", link="logit", featuresCol=OUTPUT_COL, labelCol=self.dependent_column, predictionCol=PREDICTION_COLUMN)

Since I am modeling the number of hours waiting, a binomial is an appropriate distribution.

Unfortunately, Spark does not appear to have a log-likelihood function out-of-the-box for GLMs. So, I tried to write my own.

In the Python library, StatsModels, the (abbreviated) code for Logit log likelihood is:

def loglike(self, params):
    q = 2*self.endog - 1
    X = self.exog
    return np.sum(np.log(self.cdf(q*np.dot(X,params))))

def cdf(self, X):
    X = np.asarray(X)
    return 1/(1+np.exp(-X))

from discrete_model.py.

This I turned into PySpark code:    

from scipy import special

    def log_likelihood(self) -> float:
        '''
        PySpark version of the code in StatsModel's discreet_model.Probit.loglikelihood function.
        :return: the log likelihood of the data on which this object was fitted.
        '''

        log_likelihood = "log_likelihood"
        inverse = "inverse"
        lin_pred = "lin_pred"
        df = self.df
        params = self.model.coefficients
        intercept = float(self.model.intercept)

        @F.udf
        def dot_params(exog) -> float:
            vecs = exog.toArray()
            assert len(vecs) == len(params)
            x_dot_params = sum(map(lambda x: x[0] * x[1], zip(vecs, params)))
            return float(x_dot_params) + intercept

        @F.udf
        def loglike_obs(endog, mu) -> float:
            n = 1.
            y = endog * n
            x = (special.gammaln(n + 1) - special.gammaln(y + 1) -
                    special.gammaln(n - y + 1) +
                    y * math.log(mu / (1 - mu + 1e-20)) +
                    n * math.log(1 - mu + 1e-20))
            return float(x)

        df = df.withColumn(lin_pred, dot_params(F.col(OUTPUT_COL)))  # + offset_exposure = 0. by default
        df = df.withColumn(inverse, 1. / (1 + F.exp(-1 * F.col(lin_pred))))
        df = df.withColumn(log_likelihood, loglike_obs(F.col(self.dependent_column), F.col(inverse)))
        df = df.agg(F.sum(log_likelihood).alias(log_likelihood))
        rows = df.select(log_likelihood).collect()
        return rows[0][0]

Messy but it works.

AIC

As it happens, I didn't need to bother. Log-likelihood can be derived from a method that is already in the Spark API.

Once you call GeneralizedLinearRegression.fit, you'll have a GeneralizedLinearRegressionModel. An attribute of this is the summary (of type GeneralizedLinearRegressionTrainingSummary). From this you can get the AIC:

>>> model.summary.aic
116384.05296519656

The AIC is the Akaike Information Criteria, a metric that uses Kullback-Leibler divergence. The equation is a function of the log-likelihood and the number of parameters in the model (including the intercept).

Note that the AIC is calculated based on the choice of family when we create our GeneralizedLinearRegression object.

To get the log-likelihood, some code like this:

        aic = model.summary.aic
        k = (data_frame.count() - model.summary.degreesOfFreedom)
        log_likelihood = ((2 * k) - aic) / 2

should suffice.

I'd previously used log-likelihood as a metric with which I could compare the efficacy of "number of iterations" values. Here, the data was the same and so was the model. AIC is useful if we're comparing two (or more) different models.

No comments:

Post a Comment