![]() |
A bug in calculating age |
![]() |
A bug in sampling |
Musings on Data Science, Software Architecture, Functional Programming and whatnot.
![]() |
A bug in calculating age |
![]() |
A bug in sampling |
Kibana is a Typescript/Javascript product to create visuals of logs. OpenSearch's Dashboards is the Apache licensed fork of this. Kibana is great when you know what you are looking for. But what if you don't?
Example
I have a small Kafka cluster of three nodes using the Raft protocol. I send messages then check a consumer has read all the messages. This integration test passes every time. There are no ERRORs. However, every so often, this test takes over 2 minutes when it should take about 20 seconds.
The number of lines on a good run is 4.5k and on the bad run about 20k. Twenty thousand lines is a lot to go through when you don't know what you're looking for.
I slightly adapted the code here to turn my logs into TF-IDF vectors and used Locality Sensitive Hashing to map them to a lower dimensional space. Now, we can visualise what's going on.
The good run looks like this:
[2023-07-04 14:13:10,089] INFO [TransactionCoordinator id=2] Node 1 disconnected. (org.apache.kafka.clients.NetworkClient)
[2023-07-04 14:13:10,089] WARN [TransactionCoordinator id=2] Connection to node 1 (localhost/127.0.0.1:9091) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
The bad run looks like this:
[2023-07-04 14:16:21,755] WARN [TransactionCoordinator id=2] Connection to node 1 (localhost/127.0.0.1:9091) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2023-07-04 14:16:21,805] INFO [TransactionCoordinator id=2] Node 1 disconnected. (org.apache.kafka.clients.NetworkClient)
[2023-07-04 14:16:21,805] WARN [TransactionCoordinator id=2] Connection to node 3 (localhost/127.0.0.1:9093) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2023-07-04 14:16:21,805] INFO [TransactionCoordinator id=2] Node 3 disconnected. (org.apache.kafka.clients.NetworkClient)
"ML Ops is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently" according to Wikipedia. In particular, my job includes "reproducibility and diagnostics of models and predictions" (see the "Goals" section). Despite repeatability being the bedrock of good science, it's surprising how few data scientists appreciate the importance of a pipeline that can reproduce results on demand.
ML Ops is an especially interesting field because, unlike most areas of software and data engineering, the mapping is often not one-to-one. For instance, in a typical banking app, an input (purchase of a financial instrument) leads to just one output (the client now owns a financial instrument) modulo any cross-cutting concerns such as logging. But in ML models, this is not always the case. Including a socioeconomic feature from the data set may change the output from the model in subtle ways. Or it may not at all. In fact, how can you be sure that you really did include it? Are you sure there wasn't a bug?
There seems to be a dearth of good resources for ML Ops, so here are some observations I've made.
Ontologies
Ontologies seem a really nice idea but I've never seen them work. This is partly due to paradoxes that derive from trying to be all things to all men. True life story: the discharge data for some hospitals was 01/01/1900 for about 100 patients. This lead to cumulative length-of-stays to be negative. "OK," says the ontology team. "It's only 100 patients out of 10 million so let's remove them". But the other 99 columns for these 100 patients were fine. So, during reconciliation with external systems, the QA team had a devil of a job trying to find why their numbers did not add up. They had no knowledge of the missing 100 data points that were absolutely fine other than their discharge dates.
Governance
Weekly town halls that encourage questions and maximise the pairs of eyes looking at the data while the data cleaning takes place. The format of this is very much like Prime Minister's Question Time. In addition, a regularly updated blog and threads that all teams can read aid knowledge transfer. What you really don't want is the data engineers working in isolation and not communicating regularly with downstream clients like the data scientists and QAs.
Test data
Fake data can sometimes be a problem. Say, I create some data but want to bias it. I make all people with a socioeconomic score less than X be in one class rather than the other and run a logistic regression model on it. All goes well. Then, one day, it is decided to have the socioeconomic score one-hot encoded to represent poor and everybody else. The threshold for this partition happens to be X. Then my tests start failing with "PerfectSeparationError: Perfect separation detected, results not available" and I don't know immediately why (this SO reply points out that this causes a model to blow up).
Non-determinism
If there are any functions that have non-deterministic results (for example F.first when you aggregate a groupBy in Spark) then you are immediately at odds with your QA team. For example, a health provider has demographic data including socioeconomic status. But this can change over time. They want to reduce patients events to just patients. This means a groupBy on the patient ID. But which socioeconomic indicator do we use in the aggregation?
Plenty of Data Sets
When pipelining data, it's better to persist more data sets than fewer. This facilitates QA as otherwise it's hard to understand what functions acted on a row. The disadvantage is that a pipeline may take longer to run since we're constantly persisting files to disk. Still, the greater visibility you get is worth it.
There is no tool as far as I know to see the lineage of functions on a single row.
Repeatability
I recommend having non-production transforms that examine the model's output. For instance, I'm currently working on a work flow that looks for insights in health data. I have a transform that counts the number of statistically significant insights that also have a significant effect (how you calculate these thresholds are up to you). The point being, after ever pull request, the pipeline is run and the number of insights are compared to the previous run. If they change significantly, there's a reasonable chance something nasty was introduced to the code base.
Healthy looking (auto-generated) numbers of insights from my models |
Without a decent platform, it's hard to have repeatable builds (see here and here for horror stories). Now, I have my issues with Palantir's Foundry but at least it has repeatability built in. Under the covers, it uses plain, old Git for its SCM capabilities. What's even nicer, models can be built on a 'data' branch that corresponds to the Git branch, so you won't overwrite the data created by another branch when you do a full build.
Once again, I'm looking at patient data waiting times. This time, it's the A&E department. I'm not so much interested in how long they wait but rather the binary condition of whether they waited more or less than four hours.
Logistic Regression with Spark
I used Spark's GeneralizedLinearRegression. Hyperparameter tuning did not produce much difference in output.
Splitting the data 90/10 for train and test gave me the following output on the test curves:
![]() |
Prediction where true value is 1 |
The above shows the predictions for patients that we know really did wait more than 4 hours for treatment. The below shows predictions where we really know they waited less:
![]() |
Predicition where true value is 0 |
Clearly, the model has spotted something but the overlap between the two states is large. What should the threshold be to best differentiate one from the other? Fortunately, I'm not trying to predict but to infer.
You can evaluate how good the model is with some simple PySpark code:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol=dependent_col, rawPredictionCol=prediction_col)
auroc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
auprc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
Evaluating with Spark's BinaryClassificationEvaluator gave me an area under the ROC curve of 0.79. Hardly stellar but not terrible.
"A no-skill classifier will be a horizontal line on the plot with a precision that is proportional to the number of positive examples in the dataset. For a balanced dataset this will be 0.5... Precision-recall curves (PR curves) are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance." [MachineLearningMastery]
Probabilities, Odds, and Odds-Ratios
"Odds are the probability of an event occurring divided by the probability of the event not occurring. An odds ratio is the odds of the event in one group ... divided by the odds in another group." [Making sense of odds and odds ratios]
"The [linear regression] model is specified in terms of K−1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one)... The choice of denominator is arbitrary in that the estimates are equivariant under this choice." [Elements of Statistical Learning]
K here is the number of groups. So, for the case where we're looking at true/false outputs, K=2 and the equations in ESL become the simple equation in the following section.
Model output
An example: "given estimated parameters and values for x1 and x2 , the model might predict y = 0.5, but the only meaningful values of y are 0 and 1. It is tempting to interpret a result like that as a probability; for example, we might say that a respondent with particular values of x1 and x2 has a 50% chance. But it is also possible for this model to predict y = 1.1 or y = −0.1, and those are not valid probabilities. Logistic regression avoids this problem by expressing predictions in terms of odds rather than probabilities." [Think Stats, 2nd Edition, Downey]
In a two-state output, Downey's equation would be:
y = log(odds) = log(p / (1-p)) = β0 + β1 x1 + ...
Odds(-ratios) are used when reporting effect size. "when X, the predictor is continuous, the odds ratio is constant across values of X. But probabilities aren’t" because of that non-linear relationship above between odds and p.
"It works exactly the same way as interest rates. I can tell you that an annual interest rate is 8%. So at the end of the year, you’ll earn $8 if you invested $100, or $40 if you invested $500. The rate stays constant, but the actual amount earned differs based on the amount invested. Odds ratios work the same." [TheAnalysisFactor] Since there is no bound on x1 etc, you could easily have a value greater than 1.0 or less than 0.0 which just doesn't make sense for probabilities.
Interpreting the coefficients
I have 36 features and the p-values of approximately 30 of them are all less than 0.05. This seems suspiciously good even if sometimes their corresponding coefficients are not particularly significant. "If that difference between your coefficient and zero isn't interesting to you, don't feel obligated to interpret it substantively just because p < .05." [StackOverflow]
"How does one interpret a coefficient of 0.081 (Std. Error = 0.026)? ... Incorporating the standard error we get an approximate 95% confidence interval of exp(0.081 ± 2 × 0.026) = (1.03, 1.14).
"This summary includes Z scores for each of the coefficients in the model (coefficients divided by their standard errors)... A Z score greater than approximately 2 in absolute value is significant at the 5% level." [Elements of Statistical Learning]
Intepreting the intercept
If we take the example here [QuantifyingHealth] where we're looking at the relationship between tobacco consumption (x1) and heart disease (y) over a 10 year period, then the interpretation of the intercept's coefficient depends on several conditions.
For example, say that the intercept's coefficient is -1.93. This leads us to believe the probability of heart disease for a non-smoker (x1=0) is 0.13 by plugging -1.93 in the above equation then rearranging.
If the smoking consumption is standardized, x1=0 "corresponds to the mean lifetime usage of tobacco in kg, and the interpretation becomes: For an average consumer of tobacco, the probability of having a heart disease in the next 10 years is 0.13" [ibid].
If we're only interested in smokers not the general population, then x1≠0. "Let’s pick the maximum [623kg] as a reference and calculate the limit of how much smoking can affect the risk of heart disease... We get P = 0.22. The interpretation becomes: the maximum lifetime consumption of 623 kg of tobacco is associated with a 22% risk of having heart disease".
Interesting problem: we have 33 thousand 'areas' in the UK for the purposes of health. People would like to compare one health area with another that has similar demographics etc. We'd like to find these peers automatically using machine learning but nobody really knows what makes a neighbour in feature space an actual neighbour.
This is an unsupervised ML task. We cannot use the go-to clustering algorithm KNN (where a class is elected given k nearest neighbours as calculated with some arbitrary distance metric) as this is a supervised learning algorithm and each vector is its own distinct class. That is, there is no class with more than one data point in it.
So, how do we measure how good our algorithm is? We might like to use hypothesis testing where we compare the 10 nearest neighbours (k=10 for us) from our algorithm against 10 randomly chosen feature vectors (null hypothesis). For both these samples, we follow this process: take, say, 200 other vectors randomly chosen (synthetic data) and for each feature for each of the 200, calculate whether it's within the min/max range of the sample (test statistic). With this we can calculate whether we're getting results better than chance (p-test). So, we have fulfilled all of Downey's criteria for a hypothesis test in "There is still only one test" blog post.
Of course, we would jolly well hope that our results are better than chance but this introduces another problem: when every combination of feature engineering scores 100%, which is the best? We could take the difference between the test statistic for the null hypothesis sample and the actual sample. This is what we did in the end but it is one of any number of metrics.
Another metric we tested was the Pearson correlation statistic in measuring each feature vector in a sample with all the others. If the average score for our actual sample is greater than the null hypothesis then we know we're on the right path. Unfortunately, this proved to be not great. Thinking about it, it's not too surprising since the vectors are not random numbers but the demographics within (roughly) equal sized regions in the same country.
The Algorithm
Having established a test statistic, we then need to choose an algorithm and features.
Using the Freedman Diaconis algorithm to generate optimal bucket sizes from which probabilities could be calculated, I then used the Jensen Shannon to calculate nearest neigbours. However, our data was Zipfian so this meant almost everything fell into the same bucket. With some hacking, I resized the buckets so that there were many around the sharp peak and few at the edges. However, the score from using JS was still not great. I'm told because JS is better for finding outliers, which also jives with my experience.
Instead, the algorithm we used was L2 Euclidean distance. Bearing in mind that Euclidean distance is not a good metric in high dimensions [SO], our test statistic looked good when we applied it to bucketed (!) and scaled data. What's more, it was computationally cheap (remember that to compare all nodes is an O(N2) problem) and easy to explain to the client.
Palantir
For this work, I was asked to use Palantir technology which was pretty easy since it seems to be a thin wrapper around Spark. Qualms aside about the selling at high prices open source technology (with some features actually crippled), some best practices I came to acquire:
Neural Net Gradient Optimization
Spark's implementation of artificial neural nets is a little sparse. For instance, the only activation function is the private SigmoidFunction. It seems like there are other ways of training large ANNs at scale. This is an interesting article where data scientists used Deep Java Library (DJL - not to be confused with DL4J) in a Spark framework to orchestrate PyTorch model training.
In theory, one could add one's own activation functions and extract the weights that Spark generated. I took a look at the code of MultiLayerPerceptronClassifier and saw how in each epoch, it broadcasts the weights to all the executors and then uses RDD.treeAggregate to compute the gradient per vector, per partition and then combine them all. In the case of BFGS numerical optimization technique, the Breeze library is used and the call to treeAggregate was invoked as a callback to its BFGS implementation. Note that this was done on Spark's driver and the the loss and averaged gradient are calculated there.
Here, "we cannot write down the actual mathematical formula for the function we’re optimizing. (The technical term for the function that is being optimized is response surface.)" [1]
The format of the broadcast weights confused me at first as they're a one dimensional vector when I was expecting matrices (see Neural Nets are just Matrices). On closer inspection I saw that this vector contained all the values in all the matrices for all the layers and was sliced and diced accordingly.
Hyperparameter Optimization via Randomisation
One level up, a meta-optimization if you like, we want to tune the parameters for the neural nets (choice of activation function, size and cardinality of layers etc). Spark only offers ParamGridBuilder out of the box. This is really just a simple class that creates a Cartesian product of possible parameters. This is fed to CrossValidator.setEstimatorParamMaps as a simple Array so the cross validator can explore a parameter space by brute force.
There is a more clever way that involves a random search. "for any distribution over a sample space with a finite maximum, the maximum of 60 random observations lies within the top 5% of the true maximum, with 95% probability" [1]
It seems odd but it's true. Let's look at the probability of not being in the top 5%. And let's call this top 5% space, X. Then:
P(x ∈ X) = 0.05
so, the probability of a random point not being in the top 5% is:
P(¬x ∈ X) = 1 - 0.05
Now, given N points, the probability of them all not being in the top 5% is:
P(¬xi ∈ X ∀i) = (1 - 0.05)N
So, the probability being in the top 5% is one minus this.
Say, we want 95% probability we're in the top 5%, solve this equation and you'll see N is slightly less than 60. QED.
Exploration of Space
The state space of parameters can be explored for maxima or minima via various techniques. There are more intelligent ways of doing it but one may is using Monte Carlo techniques. Here in my GitHub account is a Python/Theano/PyMC3 example using Metropolis-Hastings to sample from a beautifully named Himmelblau function. This function is nice as we know analytically that it happens to have four local minima. But let's pretend we don't know that:
![]() |
The Himmelblau function (top) and finding its minima (bottom) |
![]() |
Finding two (equal) local minima. |
![]() |
Local minima (3.0, 2.0) found in 60 steps |
[1] Alice Zheng on O'Reilly
![]() |
Typical synthetic data |
![]() |
The data projected onto the VAE's bottleneck |
![]() |
Spot the outlier |
The most important use of this ratio is in selecting a learning rate. As a rule of thumb: this ratio should be around 1:1000 = 0.001. On the log10 chart, this corresponds to a value of -3. Note that is a rough guide only, and may not be appropriate for all networks. It’s often a good starting point, however.
If the ratio diverges significantly from this (for example, > -2 (i.e., 10-2=0.01) or < -4 (i.e., 10-4=0.0001), your parameters may be too unstable to learn useful features, or may change too slowly to learn useful features. To change this ratio, adjust your learning rate (or sometimes, parameter initialization).With a learning rate of 1e-2 using activation function TANH, this chart was very jagged for me. Changing the learning rate (with Adam) to 1e-4 smoothed it with:
![]() |
Learning rate 1e-4, activation functions TANH |
![]() |
TANH; learning rate 1e-3 |
![]() |
TANH; learning rate 1e-5 |
![]() |
SOFTPLUS; learning rate = 1e-5 |
![]() |
VAE's score with vectors of character bi-gram probabilities |
![]() |
VAE's score with 1-hot encoded characters |
![]() |
VAE score with 1-hot encoding, now using Bernoulli reconstruction distribution. |
PhillHenry @PhillHenry Mar 22 16:25Silly question but I assume a negative "Model Score vs. Iteration" in the "DL4J Training UI" is a bad sign...?
Alex Black @AlexDBlack 00:32@PhillHenry as a general rule, yes, negative score is badit is typically caused by a mismatch between the data, loss function and output layer activation functionfor example, something like using tanh activation function for classification instead of softmax
PhillHenry @PhillHenry 03:02Are there any circumstances where it's OK? (BTW I'm trying to run a VAE).
Samuel Audet @saudet 03:34IIRC, this happens with VAE when it's not able to fit the distribution well
Alex Black @AlexDBlack 04:14VAE is an exception, to a degreeyou can still have a mismatch between the reconstruction distribution and the actual data, and that will give you rubbish negative results (like incorrectly trying to fit gaussian data with bernoulli reconstruction distribution)however, it is possible (but rare) for negative log likelihood to be negative on very good fit of the data - IIRC we exclude some normalization terms. if that's the case, you should see excellent reconstructions though
![]() |
ROC curve for VAE with LEAKYRELU, 1-hot encoding, Bernoulli reconstruction |
![]() |
VAE with 5 hidden layers |
![]() |
VAE with 5 hidden layers and the last layer has 2 units |
![]() |
VAE with 2 hidden units |
![]() |
KL Score Histogram with no normalisation |
Class | KL Score |
Good | 4 319 810.24 |
Bad | 4 169 380.40 |
![]() |
KL Score Histogram with all vectors L1-Normalised |
Class | KL Score |
Good | 97 285.45 |
Bad | 139 889.62 |
![]() |
KL Score Histogram with the baseline L1 normalised; all other vectors unnormalised |
![]() |
KL Score Histogram with the baseline L2-normalised; all other vectors unnormalised |
Class | KL Score |
Good | 4 914 404.73 |
Bad | 4 914 407.38 |
![]() |
KL Score Histogram with the baseline unnormalized; all other vectors L1-normalized |
Class | KL Score | |||
Good | 773 262.74 |
| ||
Bad | 704 254.41 |
Class | KL Score | |||
Good | 94 506.53 |
| ||
Bad | 83 559.17 |
![]() |
ROC for "Normalised Baselines, No Normalisation for Others" KL |