Log-likelihood
I wanted to see how the number of iterations in a Generealized Linear Model effected the performance of my mode. One way to do this is with log-likelihood. "Probability is used before data are available to describe plausibility of a future outcome, given a value for the parameter. Likelihood is used after data are available to describe plausibility of a parameter value." [Wikipedia]. In Bayesian inference, if P(C|D) is the probability, P(D|C) is the likelihood. We take the logs to make the maths easier.
Spark
My model was built using Spark's ML library. With it, I was trying to model the number of hours spent waiting in A&E. If the patient waits more than 4 hours, they have breached our target. The code looks like this:
from pyspark.ml.regression import GeneralizedLinearRegression
glr = GeneralizedLinearRegression(family="binomial", link="logit", featuresCol=OUTPUT_COL, labelCol=self.dependent_column, predictionCol=PREDICTION_COLUMN)
Since I am modeling the number of hours waiting, a binomial is an appropriate distribution.
Unfortunately, Spark does not appear to have a log-likelihood function out-of-the-box for GLMs. So, I tried to write my own.
In the Python library, StatsModels, the (abbreviated) code for Logit log likelihood is:
def loglike(self, params):
q = 2*self.endog - 1
X = self.exog
return np.sum(np.log(self.cdf(q*np.dot(X,params))))
def cdf(self, X):
X = np.asarray(X)
return 1/(1+np.exp(-X))
from discrete_model.py.
This I turned into PySpark code:
from scipy import special
def log_likelihood(self) -> float:
'''
PySpark version of the code in StatsModel's discreet_model.Probit.loglikelihood function.
:return: the log likelihood of the data on which this object was fitted.
'''
log_likelihood = "log_likelihood"
inverse = "inverse"
lin_pred = "lin_pred"
df = self.df
params = self.model.coefficients
intercept = float(self.model.intercept)
@F.udf
def dot_params(exog) -> float:
vecs = exog.toArray()
assert len(vecs) == len(params)
x_dot_params = sum(map(lambda x: x[0] * x[1], zip(vecs, params)))
return float(x_dot_params) + intercept
@F.udf
def loglike_obs(endog, mu) -> float:
n = 1.
y = endog * n
x = (special.gammaln(n + 1) - special.gammaln(y + 1) -
special.gammaln(n - y + 1) +
y * math.log(mu / (1 - mu + 1e-20)) +
n * math.log(1 - mu + 1e-20))
return float(x)
df = df.withColumn(lin_pred, dot_params(F.col(OUTPUT_COL))) # + offset_exposure = 0. by default
df = df.withColumn(inverse, 1. / (1 + F.exp(-1 * F.col(lin_pred))))
df = df.withColumn(log_likelihood, loglike_obs(F.col(self.dependent_column), F.col(inverse)))
df = df.agg(F.sum(log_likelihood).alias(log_likelihood))
rows = df.select(log_likelihood).collect()
return rows[0][0]
Messy but it works.
AIC
As it happens, I didn't need to bother. Log-likelihood can be derived from a method that is already in the Spark API.
Once you call GeneralizedLinearRegression.fit, you'll have a GeneralizedLinearRegressionModel. An attribute of this is the summary (of type GeneralizedLinearRegressionTrainingSummary). From this you can get the AIC:
>>> model.summary.aic
116384.05296519656
The AIC is the Akaike Information Criteria, a metric that uses Kullback-Leibler divergence. The equation is a function of the log-likelihood and the number of parameters in the model (including the intercept).
Note that the AIC is calculated based on the choice of family when we create our GeneralizedLinearRegression object.
To get the log-likelihood, some code like this:
aic = model.summary.aic
k = (data_frame.count() - model.summary.degreesOfFreedom)
log_likelihood = ((2 * k) - aic) / 2
should suffice.
I'd previously used log-likelihood as a metric with which I could compare the efficacy of "number of iterations" values. Here, the data was the same and so was the model. AIC is useful if we're comparing two (or more) different models.