Agile Java Man: Machine Learning

Showing posts with label Machine Learning. Show all posts

Sunday, June 15, 2025

Making ML projects more robust

The software industry has made great strides to make the engineering process more robust. Despite being an adjacent disciple, data science is still in the dark ages.

You can unit test any code so there is no excuse for not writing them. However, a lot of data science tools are not ergonomic in this regard. Notebooks are great for EDA, less so for robust software practises.

So, although they are not a replacement for automated tests, here are some tips if you're stuck in the world of notebooks.

Observability

If your platform does not support it out of the box, add it manually. For instance, in one project, there was one table that another 5 joined to at some point downstream in the pipeline. That first table was run several times with different parameters. This lead to the potential problem that the 6 tables in total may be out of synch with each other. A solution was to add a runtime date in the first table and all the other 5 preserve this value when they are generated. The Python code that then uses this data asserts that the date is consistent across all tables.

Plot graphs of important features

In a bioinformatics project, the age of a patient was calculated in Python as the time between a patient's date of birth and the start of the observation period. This was reasonable. But several weeks later, the client asked for all of a patients medical details to be used in the study. Consequently, the observation period - calculated in SQL - effectively became a patient's first contact with the medical services.

For most people, this happens in the first year of life. For them, their age became zero. This wasn't universally true. Imigrants would have their first medical intervention later in life. So, somebody browsing the data might think it odd there was so many infants in their sample but assume that it was just a statistical fluctuation as there were clearly 30 and 40 year olds in there too.

However, when you plotted the age after the SQL code change, the graph of the population's age looked like this:

A bug in calculating age

Without an end-to-end suite of tests, nobody thought of updating the age calculation logic.

Sampling

The moment you add constraints to sampled data, you must beware of some statistical oddities.

We wanted to assign a cutoff date to the negative cohort randomly sampled from the date from the positive. Sounds straightforward, right? Well, then you have to add a constraint that the cutoff date only made sense if it came after a patient's date of birth. But this very reasonable constraint skews the data. The reason being that somebody born in 2000 can be given any cutoff date 2000 and 2025 giving them an age at cutoff of 0 to 25 years old. After sampling, we might give them a cutoff date of 2003 and consequently an age of 3.

A bug in sampling

However, somebody born in 2022 can have a maximum age of 3 after sampling. Consequently, we have a situation where the 25-year old born in 2000 can contribute to the 3-year old bucket after sampling, but the 3-year old born in 2022 cannot contribute to the 25-year old bucket. Hence, you get far more 3-year olds than are really in the data.

Wednesday, August 30, 2023

Can we apply ML to logging?

Kibana is a Typescript/Javascript product to create visuals of logs. OpenSearch's Dashboards is the Apache licensed fork of this. Kibana is great when you know what you are looking for. But what if you don't?

Example

I have a small Kafka cluster of three nodes using the Raft protocol. I send messages then check a consumer has read all the messages. This integration test passes every time. There are no ERRORs. However, every so often, this test takes over 2 minutes when it should take about 20 seconds.

The number of lines on a good run is 4.5k and on the bad run about 20k. Twenty thousand lines is a lot to go through when you don't know what you're looking for.

I slightly adapted the code here to turn my logs into TF-IDF vectors and used Locality Sensitive Hashing to map them to a lower dimensional space. Now, we can visualise what's going on.

The good run looks like this:

Note that there are two dominant lines that map to:

[2023-07-04 14:13:10,089] INFO [TransactionCoordinator id=2] Node 1 disconnected. (org.apache.kafka.clients.NetworkClient)
[2023-07-04 14:13:10,089] WARN [TransactionCoordinator id=2] Connection to node 1 (localhost/127.0.0.1:9091) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

repeated over and over for about 10 seconds.

The bad run looks like this:

Here, the dominant lines in the diagram that are from:

[2023-07-04 14:16:21,755] WARN [TransactionCoordinator id=2] Connection to node 1 (localhost/127.0.0.1:9091) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2023-07-04 14:16:21,805] INFO [TransactionCoordinator id=2] Node 1 disconnected. (org.apache.kafka.clients.NetworkClient)
[2023-07-04 14:16:21,805] WARN [TransactionCoordinator id=2] Connection to node 3 (localhost/127.0.0.1:9093) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2023-07-04 14:16:21,805] INFO [TransactionCoordinator id=2] Node 3 disconnected. (org.apache.kafka.clients.NetworkClient)

again being repeated but this time, it lasts for about 2 minutes.

[The code in Ethen Lui's GitHub is really quite clever. Rather than using MinHashing, he's projecting the feature vectors against some randomly generated vectors and making a bit map from it. This can be turned into a single integer which represents the feature's bucket. Note that the number of vectors does not really change the dimensionality of the space but it does change how consistent different runs are - more vectors leads to greater repeatability]

Still at something of a loss, I checked out Bitnami's Kafka instances (here), changed the logging in bitnami/kafka/3.5/debian-11/rootfs/opt/bitnami/scripts/kafka/postunpack.sh by adding the line:

replace_in_file "${KAFKA_CONF_DIR}/log4j.properties" "INFO" "DEBUG"

and built the Docker image again. Now it gives me DEBUG statements.

Fix

The problem of non-determinism is still foxing me but the solution became clear with all these mentions of localhost. We need the client to communicate with the cluster on localhost because the client is unaware that the Kafka instances are hosted in Docker. However, each broker does need to know it's talking to another Docker container as the ports of its peers are not available within its own sandbox.

The solution was to use slightly different values for the listeners as the advertised listeners (KAFKA_CFG_LISTENERS vs KAFKA_CFG_ADVERTISED_LISTENERS. Note that Bitnami expects environment variables prepended with KAFKA_CFG_ and periods as underscores before it converts them into a Kafka-friendly server.properties file).

The listeners were of the form OUTSIDE://:9111 while the advertised listeners were of the form OUTSIDE://localhost:9111. The label OUTSIDE apparently is arbitrary. It's just used as a reference, say in listener.security.protocol (in Kafka-speak; munge with the necessary Bitnami mappings to make it appear in server.properties) where you'll see something like OUTSIDE:PLAINTEXT.

Conclusion

Although I've fixed the Kafka issue I was facing, applying ML to the logs was only a partial help. I still need to understand the Kafka Raft code better before it can truly be of use.

Wednesday, February 23, 2022

ML Ops

"ML Ops is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently" according to Wikipedia. In particular, my job includes "reproducibility and diagnostics of models and predictions" (see the "Goals" section). Despite repeatability being the bedrock of good science, it's surprising how few data scientists appreciate the importance of a pipeline that can reproduce results on demand.

ML Ops is an especially interesting field because, unlike most areas of software and data engineering, the mapping is often not one-to-one. For instance, in a typical banking app, an input (purchase of a financial instrument) leads to just one output (the client now owns a financial instrument) modulo any cross-cutting concerns such as logging. But in ML models, this is not always the case. Including a socioeconomic feature from the data set may change the output from the model in subtle ways. Or it may not at all. In fact, how can you be sure that you really did include it? Are you sure there wasn't a bug?

There seems to be a dearth of good resources for ML Ops, so here are some observations I've made.

Ontologies

Ontologies seem a really nice idea but I've never seen them work. This is partly due to paradoxes that derive from trying to be all things to all men. True life story: the discharge data for some hospitals was 01/01/1900 for about 100 patients. This lead to cumulative length-of-stays to be negative. "OK," says the ontology team. "It's only 100 patients out of 10 million so let's remove them". But the other 99 columns for these 100 patients were fine. So, during reconciliation with external systems, the QA team had a devil of a job trying to find why their numbers did not add up. They had no knowledge of the missing 100 data points that were absolutely fine other than their discharge dates.

Governance

Weekly town halls that encourage questions and maximise the pairs of eyes looking at the data while the data cleaning takes place. The format of this is very much like Prime Minister's Question Time. In addition, a regularly updated blog and threads that all teams can read aid knowledge transfer. What you really don't want is the data engineers working in isolation and not communicating regularly with downstream clients like the data scientists and QAs.

Test data

Fake data can sometimes be a problem. Say, I create some data but want to bias it. I make all people with a socioeconomic score less than X be in one class rather than the other and run a logistic regression model on it. All goes well. Then, one day, it is decided to have the socioeconomic score one-hot encoded to represent poor and everybody else. The threshold for this partition happens to be X. Then my tests start failing with "PerfectSeparationError: Perfect separation detected, results not available" and I don't know immediately why (this SO reply points out that this causes a model to blow up).

Non-determinism

If there are any functions that have non-deterministic results (for example F.first when you aggregate a groupBy in Spark) then you are immediately at odds with your QA team. For example, a health provider has demographic data including socioeconomic status. But this can change over time. They want to reduce patients events to just patients. This means a groupBy on the patient ID. But which socioeconomic indicator do we use in the aggregation?

Plenty of Data Sets

When pipelining data, it's better to persist more data sets than fewer. This facilitates QA as otherwise it's hard to understand what functions acted on a row. The disadvantage is that a pipeline may take longer to run since we're constantly persisting files to disk. Still, the greater visibility you get is worth it.

There is no tool as far as I know to see the lineage of functions on a single row.

Repeatability

I recommend having non-production transforms that examine the model's output. For instance, I'm currently working on a work flow that looks for insights in health data. I have a transform that counts the number of statistically significant insights that also have a significant effect (how you calculate these thresholds are up to you). The point being, after ever pull request, the pipeline is run and the number of insights are compared to the previous run. If they change significantly, there's a reasonable chance something nasty was introduced to the code base.

Healthy looking (auto-generated) numbers of insights from my models

Without a decent platform, it's hard to have repeatable builds (see here and here for horror stories). Now, I have my issues with Palantir's Foundry but at least it has repeatability built in. Under the covers, it uses plain, old Git for its SCM capabilities. What's even nicer, models can be built on a 'data' branch that corresponds to the Git branch, so you won't overwrite the data created by another branch when you do a full build.

Wednesday, November 10, 2021

Logistic Regression in Action

Once again, I'm looking at patient data waiting times. This time, it's the A&E department. I'm not so much interested in how long they wait but rather the binary condition of whether they waited more or less than four hours.

Logistic Regression with Spark

I used Spark's GeneralizedLinearRegression. Hyperparameter tuning did not produce much difference in output.

Splitting the data 90/10 for train and test gave me the following output on the test curves:

Prediction where true value is 1

The above shows the predictions for patients that we know really did wait more than 4 hours for treatment. The below shows predictions where we really know they waited less:

Predicition where true value is 0

Clearly, the model has spotted something but the overlap between the two states is large. What should the threshold be to best differentiate one from the other? Fortunately, I'm not trying to predict but to infer.

You can evaluate how good the model is with some simple PySpark code:

from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol=dependent_col, rawPredictionCol=prediction_col)
auroc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
auprc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})

Evaluating with Spark's BinaryClassificationEvaluator gave me an area under the ROC curve of 0.79. Hardly stellar but not terrible.

"A no-skill classifier will be a horizontal line on the plot with a precision that is proportional to the number of positive examples in the dataset. For a balanced dataset this will be 0.5... Precision-recall curves (PR curves) are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance." [MachineLearningMastery]

Probabilities, Odds, and Odds-Ratios

"Odds are the probability of an event occurring divided by the probability of the event not occurring. An odds ratio is the odds of the event in one group ... divided by the odds in another group." [Making sense of odds and odds ratios]

"The [linear regression] model is specified in terms of K−1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one)... The choice of denominator is arbitrary in that the estimates are equivariant under this choice." [Elements of Statistical Learning]

K here is the number of groups. So, for the case where we're looking at true/false outputs, K=2 and the equations in ESL become the simple equation in the following section.

Model output

An example: "given estimated parameters and values for x₁ and x₂ , the model might predict y = 0.5, but the only meaningful values of y are 0 and 1. It is tempting to interpret a result like that as a probability; for example, we might say that a respondent with particular values of x₁ and x₂ has a 50% chance. But it is also possible for this model to predict y = 1.1 or y = −0.1, and those are not valid probabilities. Logistic regression avoids this problem by expressing predictions in terms of odds rather than probabilities." [Think Stats, 2nd Edition, Downey]

In a two-state output, Downey's equation would be:

y = log(odds) = log(p / (1-p)) = β₀ + β₁ x₁ + ...

Odds(-ratios) are used when reporting effect size. "when X, the predictor is continuous, the odds ratio is constant across values of X. But probabilities aren’t" because of that non-linear relationship above between odds and p.

"It works exactly the same way as interest rates. I can tell you that an annual interest rate is 8%. So at the end of the year, you’ll earn $8 if you invested $100, or $40 if you invested $500. The rate stays constant, but the actual amount earned differs based on the amount invested. Odds ratios work the same." [TheAnalysisFactor] Since there is no bound on x₁ etc, you could easily have a value greater than 1.0 or less than 0.0 which just doesn't make sense for probabilities.

Interpreting the coefficients

I have 36 features and the p-values of approximately 30 of them are all less than 0.05. This seems suspiciously good even if sometimes their corresponding coefficients are not particularly significant. "If that difference between your coefficient and zero isn't interesting to you, don't feel obligated to interpret it substantively just because p < .05." [StackOverflow]

"How does one interpret a coefficient of 0.081 (Std. Error = 0.026)? ... Incorporating the standard error we get an approximate 95% confidence interval of exp(0.081 ± 2 × 0.026) = (1.03, 1.14).

"This summary includes Z scores for each of the coefficients in the model (coefficients divided by their standard errors)... A Z score greater than approximately 2 in absolute value is significant at the 5% level." [Elements of Statistical Learning]

Intepreting the intercept

If we take the example here [QuantifyingHealth] where we're looking at the relationship between tobacco consumption (x₁) and heart disease (y) over a 10 year period, then the interpretation of the intercept's coefficient depends on several conditions.

For example, say that the intercept's coefficient is -1.93. This leads us to believe the probability of heart disease for a non-smoker (x₁=0) is 0.13 by plugging -1.93 in the above equation then rearranging.

If the smoking consumption is standardized, x₁=0 "corresponds to the mean lifetime usage of tobacco in kg, and the interpretation becomes: For an average consumer of tobacco, the probability of having a heart disease in the next 10 years is 0.13" [ibid].

If we're only interested in smokers not the general population, then x₁≠0. "Let’s pick the maximum [623kg] as a reference and calculate the limit of how much smoking can affect the risk of heart disease... We get P = 0.22. The interpretation becomes: the maximum lifetime consumption of 623 kg of tobacco is associated with a 22% risk of having heart disease".

Friday, March 26, 2021

Unsupervised nearest neighbours

Interesting problem: we have 33 thousand 'areas' in the UK for the purposes of health. People would like to compare one health area with another that has similar demographics etc. We'd like to find these peers automatically using machine learning but nobody really knows what makes a neighbour in feature space an actual neighbour.

This is an unsupervised ML task. We cannot use the go-to clustering algorithm KNN (where a class is elected given k nearest neighbours as calculated with some arbitrary distance metric) as this is a supervised learning algorithm and each vector is its own distinct class. That is, there is no class with more than one data point in it.

So, how do we measure how good our algorithm is? We might like to use hypothesis testing where we compare the 10 nearest neighbours (k=10 for us) from our algorithm against 10 randomly chosen feature vectors (null hypothesis). For both these samples, we follow this process: take, say, 200 other vectors randomly chosen (synthetic data) and for each feature for each of the 200, calculate whether it's within the min/max range of the sample (test statistic). With this we can calculate whether we're getting results better than chance (p-test). So, we have fulfilled all of Downey's criteria for a hypothesis test in "There is still only one test" blog post.

Of course, we would jolly well hope that our results are better than chance but this introduces another problem: when every combination of feature engineering scores 100%, which is the best? We could take the difference between the test statistic for the null hypothesis sample and the actual sample. This is what we did in the end but it is one of any number of metrics.

Another metric we tested was the Pearson correlation statistic in measuring each feature vector in a sample with all the others. If the average score for our actual sample is greater than the null hypothesis then we know we're on the right path. Unfortunately, this proved to be not great. Thinking about it, it's not too surprising since the vectors are not random numbers but the demographics within (roughly) equal sized regions in the same country.

The Algorithm

Having established a test statistic, we then need to choose an algorithm and features.

Using the Freedman Diaconis algorithm to generate optimal bucket sizes from which probabilities could be calculated, I then used the Jensen Shannon to calculate nearest neigbours. However, our data was Zipfian so this meant almost everything fell into the same bucket. With some hacking, I resized the buckets so that there were many around the sharp peak and few at the edges. However, the score from using JS was still not great. I'm told because JS is better for finding outliers, which also jives with my experience.

Instead, the algorithm we used was L2 Euclidean distance. Bearing in mind that Euclidean distance is not a good metric in high dimensions [SO], our test statistic looked good when we applied it to bucketed (!) and scaled data. What's more, it was computationally cheap (remember that to compare all nodes is an O(N²) problem) and easy to explain to the client.

Palantir

For this work, I was asked to use Palantir technology which was pretty easy since it seems to be a thin wrapper around Spark. Qualms aside about the selling at high prices open source technology (with some features actually crippled), some best practices I came to acquire:

Don't have multiple tabs open on the codebase. Apparently, you can download the code into your favourite IDE but you won't have the data. So, at some point, you'll be coding in the browser. If you have multiple tabs open, Palantir gets confused and overwrites versions. Not a great user experience.
Create a pyspark file for each combination in the pipeline that is only a couple of lines long which delegates to shared code. This way, it's easier to explore lineage with the Monocle tool. This is actually quite a nice feature of Palantir giving you a graphical representation of your pipelines.

Friday, January 29, 2021

Spark Numerical Optimization

Neural Net Gradient Optimization

Spark's implementation of artificial neural nets is a little sparse. For instance, the only activation function is the private SigmoidFunction. It seems like there are other ways of training large ANNs at scale. This is an interesting article where data scientists used Deep Java Library (DJL - not to be confused with DL4J) in a Spark framework to orchestrate PyTorch model training.

In theory, one could add one's own activation functions and extract the weights that Spark generated. I took a look at the code of MultiLayerPerceptronClassifier and saw how in each epoch, it broadcasts the weights to all the executors and then uses RDD.treeAggregate to compute the gradient per vector, per partition and then combine them all. In the case of BFGS numerical optimization technique, the Breeze library is used and the call to treeAggregate was invoked as a callback to its BFGS implementation. Note that this was done on Spark's driver and the the loss and averaged gradient are calculated there.

Here, "we cannot write down the actual mathematical formula for the function we’re optimizing. (The technical term for the function that is being optimized is response surface.)" [1]

The format of the broadcast weights confused me at first as they're a one dimensional vector when I was expecting matrices (see Neural Nets are just Matrices). On closer inspection I saw that this vector contained all the values in all the matrices for all the layers and was sliced and diced accordingly.

Hyperparameter Optimization via Randomisation

One level up, a meta-optimization if you like, we want to tune the parameters for the neural nets (choice of activation function, size and cardinality of layers etc). Spark only offers ParamGridBuilder out of the box. This is really just a simple class that creates a Cartesian product of possible parameters. This is fed to CrossValidator.setEstimatorParamMaps as a simple Array so the cross validator can explore a parameter space by brute force.

There is a more clever way that involves a random search. "for any distribution over a sample space with a finite maximum, the maximum of 60 random observations lies within the top 5% of the true maximum, with 95% probability" [1]

It seems odd but it's true. Let's look at the probability of not being in the top 5%. And let's call this top 5% space, X. Then:

P(x ∈ X) = 0.05

so, the probability of a random point not being in the top 5% is:

P(¬x ∈ X) = 1 - 0.05

Now, given N points, the probability of them all not being in the top 5% is:

P(¬x_i ∈ X ∀i) = (1 - 0.05)^N

So, the probability being in the top 5% is one minus this.

Say, we want 95% probability we're in the top 5%, solve this equation and you'll see N is slightly less than 60. QED.

Exploration of Space

The state space of parameters can be explored for maxima or minima via various techniques. There are more intelligent ways of doing it but one may is using Monte Carlo techniques. Here in my GitHub account is a Python/Theano/PyMC3 example using Metropolis-Hastings to sample from a beautifully named Himmelblau function. This function is nice as we know analytically that it happens to have four local minima. But let's pretend we don't know that:

The Himmelblau function (top) and finding its minima (bottom)

The path to explore the space starts in a random place but quickly finds one of the local minima.

We might be lucky and the algorithm (non-deterministically) jumps out of its local minima and finds another:

Finding two (equal) local minima.

But Metropolis-Hastings generally needs a lot of samples to find all local minima (in the diagram above, it was 30k and we only found two of the four). But it can find at least find some local minimum quite easily in as few as 60 steps:

Local minima (3.0, 2.0) found in 60 steps

The MH algorithm adapts with each step whereas the Spark cross validation API expects the parameter co-ordinates to be fed to it a priori. If Spark is to have a more intelligent search space agent, it won't be via this class.

[1] Alice Zheng on O'Reilly

Friday, January 15, 2021

Neural Nets and Anomaly Detection pt 1

More playing with anomaly detection, this time with Keras. All the data is synthetic and can be generated from the code on my GitHub here.

The Data

Our data looks like this:

Typical synthetic data

The top row shows noisy data that have a clear global periodicity. The bottow row shows noisy data with each column having a periodicity independent of the other columns.

Can a Variational Auto Encoder tell the difference?

The Model

The code for our neural net was adapted from this StackOverflow answer here. Very quickly (roughly 10 epochs), the VAE was able to differentiate which group a given square was in with an accuracy of about 90%.

The data projected onto the VAE's bottleneck

Note that I am not going to tune the VAE to get the last drop of accuracy. I'm happy with about 90%.

Note that the trick with VAEs is that the data with which we train it is also the data that we expect. This is simply because we're asking the neural net to reproduce what it's been given. It won't be exact at all since we're deliberately compressing the data through a bottleneck. This is the very definition of an auto encoder. The 'variational' term comes from the fact that the neural net learns the probabilities of the distribution representing the data [Quora].

The Results

The trouble with VAEs is that although you might be able to print a pretty graph of the rendered data, you might need some algorithm to differentiate all the data for you. I'm going to use SciKitLearn to run a KMeans algorithm since I know there are exactly 2 groups (that is, k=2 here).

This is fine in this example where I have two groups of 256 elements. What if we're faced with imbalanced data and are looking for outliers? KMeans does not do so well with that. It doggedly tries to find two clusters of roughly equal size giving an accuracy of about 50% - the monkey score.

Spot the outlier

To this end, we can use the DBScan algorithm, again from SKLearn. This finds my outlier but does take a bit of tuning.

Still playing...

Friday, March 29, 2019

Hands on tuning a neural net

Introduction

Once again, I am trying to find what a typical domain name is using machine learning. For this, I use one million domains that Cisco helpfully have made available here. Once I have this, I can then use known malicious domains (provided by Bambenek Consulting here) to see how they deviate.

The data

I encode my domains first by creating a histogram of character pairs and turning this into probabilities. Now, each data point is a sparse vector of length 1444 (38 x 38 legitimate characters used in domain names).

Now, I fire up a Variational Autoencoder written using the DL4J framework.

Rig

I have an old GeForce GTX 650 GPU that has only 1gb of memory. "To be blunt, 1GB is really tiny by today's standards, it's not really enough for deep learning especially if you're running a display on it at the same time (OS will use some of the memory)" Alex Black, DL4J Gitter room (Jan 29 00:46 2019)

So, I ran the neural net on the CPU.

I run my code with:

export D=/media/sdb8/Data/CiscoUmbrella/MyData/Real/Results/`date +%d%m%y%H%M` ; mkdir -p $D ; export MAVEN_OPTS="-Xmx32g" ; taskset 0xFFFCE mvn clean install -DskipTests exec:java -Dexec.mainClass=XXX -Dexec.args="YYY" | tee $D/mvn.log ; shutter --window=.*firefox.* -o $D/overview.png -e

Note taskset which uses all cores but one (15 out of my 16) so the OS is not overwhelmed.

Also, note shutter which takes a picture of the DL4J console that I am running in Firefox.

Learning Rate

On the "Update: Parameter Ratio Chart" tab, the mean magnitude means "the average of the absolute value of the parameters or updates at the current time step". From the DeepLearning4J visualisation guide:

The most important use of this ratio is in selecting a learning rate. As a rule of thumb: this ratio should be around 1:1000 = 0.001. On the log₁₀ chart, this corresponds to a value of -3. Note that is a rough guide only, and may not be appropriate for all networks. It’s often a good starting point, however.

If the ratio diverges significantly from this (for example, > -2 (i.e., 10-2=0.01) or < -4 (i.e., 10-4=0.0001), your parameters may be too unstable to learn useful features, or may change too slowly to learn useful features. To change this ratio, adjust your learning rate (or sometimes, parameter initialization).

With a learning rate of 1e-2 using activation function TANH, this chart was very jagged for me. Changing the learning rate (with Adam) to 1e-4 smoothed it with:

Learning rate 1e-4, activation functions TANH

The learning rate seems to be quite sensitive. With a value of 1e-3, performance is already degrading. Note the Parameter Ratios are touching -2 and above and becoming volatile:

TANH; learning rate 1e-3

The same is true of 1e-5 where this time the parameters are drifting below -4:

TANH; learning rate 1e-5

So, it seems at least for TANH that a learning rate of 1e-4 is the sweet spot.

But with LEAKYRELU, Parameter Ratios started touching -4 and then we started seeing NaNs again. Similarly, for SOFTPLUS, even a learning rate of 1e-5 wasn't enough to save us from NaNs:

SOFTPLUS; learning rate = 1e-5

This came as something of a surprise as LEAKYRELU is often said to be the default activation function.

What's more, my score sucked:

VAE's score with vectors of character bi-gram probabilities

We'll return to the choice of activation function later. In the meantime...

The Data (again)

Maybe representing my data as vector probabilities was not a great idea. So, let's try 1-hot encoding the characters:

VAE's score with 1-hot encoded characters

Ah, that's better (if far from optimal). It's also slower to train as the vectors are just as sparse but now have a size of about 4000.

Reconstruction Distribution

But wait - if the data has changed in its nature, maybe we need to re-asses the neural net. Well, the reconstruction distribution was a Gaussian which might have made sense when we were dealing with probability vectors. However, we're now 1-hot encoding so let's change it to a Bernoulli since "The Bernoulli distribution is binary, so it assumes that observations may only have two possible outcomes." (StackOverflow). Well, that sounds reasonable so let' see:

VAE score with 1-hot encoding, now using Bernoulli reconstruction distribution.

Negative scores? Time to ask the DL4J Gitter group:

PhillHenry @PhillHenry Mar 22 16:25Silly question but I assume a negative "Model Score vs. Iteration" in the "DL4J Training UI" is a bad sign...?
Alex Black @AlexDBlack 00:32@PhillHenry as a general rule, yes, negative score is badit is typically caused by a mismatch between the data, loss function and output layer activation functionfor example, something like using tanh activation function for classification instead of softmax
PhillHenry @PhillHenry 03:02Are there any circumstances where it's OK? (BTW I'm trying to run a VAE).
Samuel Audet @saudet 03:34IIRC, this happens with VAE when it's not able to fit the distribution well
Alex Black @AlexDBlack 04:14VAE is an exception, to a degreeyou can still have a mismatch between the reconstruction distribution and the actual data, and that will give you rubbish negative results (like incorrectly trying to fit gaussian data with bernoulli reconstruction distribution)however, it is possible (but rare) for negative log likelihood to be negative on very good fit of the data - IIRC we exclude some normalization terms. if that's the case, you should see excellent reconstructions though

Well, I've already switched to Bernoulli so what's going wrong?

The Data (yet again)

Scratching my head over a coffee, it struck me. I was still standardising the data as if it were Gaussian! Not sure how a unit test could have saved me here but OK, let's remove the standardisation since why would you standardise a Bernoulli distribution?

VAE score with LEAKYRELU

Wow, that looks almost normal.

The Activation Function

Since we've changed an awful lot, now might be a good time to try to ditch TANH and try LEAKYREUL again. Sure enough, we don't see any NaNs and the score looks good.

Let's see what the ROC curve looks like with my test data:

ROC curve for VAE with LEAKYRELU, 1-hot encoding, Bernoulli reconstruction

OK, not amazing but it's starting to look better than a monkey score. But how do we make things better?

Architecture

Let's start with the architecture of the net itself.

"Larger networks will always work better than smaller networks, but their higher model capacity must be appropriately addressed with stronger regularization (such as higher weight decay), or they might overfit... The takeaway is that you should not be using smaller networks because you are afraid of overfitting. Instead, you should use as big of a neural network as your computational budget allows, and use other regularization techniques to control overfitting." (Andrej Karpathy)

"How many hidden layers are required to see an improvement where a shallow net underperforms is anyone's guess, but in general more depth would be better - you get more abstract, more general solutions. In practice though, optimization isn't quite so neat, and adding capacity adds risks of the process falling into various pitfalls - local minima, overfitting... And of course then there's the added computational cost." [StackOverflow]

So, let's try an architecture with layers of size [3002, 1200, 480, 192, 76, 30]:

VAE with 5 hidden layers

Here (StackOverflow) is an example of where the latent space is just too big causing overfitting. "You could compress the output further... Autoencoders do exactly that, except they get to pick the features themselves... So what you do in your model is that you're describing your input image using over sixty-five thousand features. And in a variational autoencoder, each feature is actually a sliding scale between two distinct versions of a feature, e.g. male/female for faces... Can you think of just a hundred ways to describe the differences between two realistic pictures in a meaningful way? Possible, I suppose, but they'll get increasingly forced as you try to go on. With so much room to spare, the optimizer can comfortably encode each distinct training image's features in a non-overlapping slice of the latent space rather than learning the features of the training data taken globally."

The takeaway point is not to make the latent space layer too big. We deliberately want to squeeze the data through a small gap. So, let's really squeeze this data with an architecture like [2964, 1185, 474, 189, 75, 2]

VAE with 5 hidden layers and the last layer has 2 units

Hmm, no change and the same happened with final units of 5, 10 and 20. So, this looks like a dead-end.

Also, those weights are going crazy...

Weights

"For NNs though, what you want out of your deep layers is non-linearity. That's where the magnitudes of weights start to matter.

"In most activation functions, very small weights get more or less ignored or are treated as 'evidence against activation' so to speak.

"On top of that some activations, like sigmoid and tanh, get saturated. A large enough weight will effectively fix the output of the neuron to either the maximum or the minimum value of the activation function, so the higher the relative weights, the less room is left for subtlety in deciding whether to pass on the activation or not based on the inputs." [StackOverflow]

Regularization

"A model with large weights is more complex than a model with smaller weights. It is a sign of a network that may be overly specialized to training data. In practice, we prefer to choose the simpler models to solve a problem (e.g. Occam’s razor). We prefer models with smaller weights... The L2 approach is perhaps the most used and is traditionally referred to as “weight decay” in the field of neural networks." (MachineLearningMastery)

However, massively increasing the L2 score made the score worse and no amount of tuning seemed to change that. I was hoping to clip excessive activations but adding RenormalizeL2PerParamType didn't seem to make any difference either.

Well, since adding more layers can be responsible for overfitting, let's now take them away. With merely two hidden layers of size [1003, 401] the weights look more respectable.

VAE with 2 hidden units

The Data (last time)

Finally, I went back to the domain names yet again and stripped them not just of the Top Level Domains but also any preceding text. So, for instance, if the domain name is javaagile.blogspot.com it is mapped to simply blogspot.

Gratifyingly, the the score went from lingering at about 100 down to 40 and the ROC curve now looks decent:

Things learned

- Unit test your data munging code.

- Keep your logs. Print out your hyper-parameters so you capture this in the logs.

- Shuffle your data! I wasted a day testing the neural net on a subset of data but this subset was sorted! Therefore, only the top 10000 data points were being used. You might find the Linux command, shuf, useful for doing this.

- Write tests that test your results, making sure they're sensible

Goldilocks

Getting the best out of the neural net seems to be a combination of tuning:

The data that best represents your problem. This is perhaps the most important.
The hyper-parameters.
The architecture.

As an aside, take a look at TensorFlow playground. It's a great way to get an intuitive feel for what is going on. It's just a neural net that runs in your browser.

Tuesday, March 19, 2019

Spotting Dodgy Domain Names

These are various approaches employing machine learning to differentiate between good domain names and bad ones. By bad, I mean domains that are used to trick people into thinking they're clicking on a legitimate address (www.goog1e.com, for instance).

Data

The data is the top 1 million domain names as recorded by Cisco. You can find it here.

The data was stripped of the top level domains to remove elements that were not useful.

Then, what is left was ~~one-hot encoded~~ converted to bigrams of characters leading to vectors of length 1444 (that is, 38 x 38 possible ASCII characters). The code for this lives here.

This data set was then split down a 95/5 ratio of training to holdout.

We created test data from this holdout data when we deliberately corrupted it. This was done by changing either a single 'o' to become an '0' or a single 'l' to become a '1'. If there were no such characters to corrupt, the data point was discarded.

Kullback-Leibler Results

We actually use a variant of KL divergence in these results that handles zeros in the data - the Jensen Shannon metric.

In the following histograms, red indicates bad domains and green good ones.

The tables represent the entire holdout ("Good") and the entire test ("Bad") data sets with their Jensen-Shannon metric calculated against the training data.

Note that these metrics are calculated by summing the columns of the data sets. This leads to something of an unrepresentative description of the data since the original is one-hot encoded. Therefore, in any subset of 38 elements of a real vector, only one can be 1 and all the rest must be 0. That is, the elements in a vector for a given domain are not independent of each other.

No normalisation

KL Score Histogram with no normalisation

Note that "+4.914e6" in the bottom right hand corner. Indeed the KL scores are close:

Class	KL Score
Good	4 319 810.24
Bad	4 169 380.40

There's a hair's breadth between them so this is probably going to be hard to differentiate the classes.

L1-Normalise everything

Here, the control group's KL was 6370.17 and the bad domains scored 6370.92 - very close. The histograms unsurprisingly look similar:

KL Score Histogram with all vectors L1-Normalised

Hmm, still not much to work with, so let's try combinations of the two. First:

Normalised Baselines, No Normalisation for Others

In this trial, the baseline is L1-normalised but the other vectors are not.

Class	KL Score
Good	97 285.45
Bad	139 889.62

The histogram for the holdout and bad domains now looks like:

KL Score Histogram with the baseline L1 normalised; all other vectors unnormalised

This is good. There are now two distinct distributions with separate peaks.

L2-normalisation gave very similar KL scores and a graph that looked like:

KL Score Histogram with the baseline L2-normalised; all other vectors unnormalised

Let's try:

Unnormalised Baseline, L1-Normalisation for Others

... and it looks like we're back to square one. The KL scores are amazingly close:

Class	KL Score
Good	4 914 404.73
Bad	4 914 407.38

so, not surprisingly are the distributions of the holdout and test data:

KL Score Histogram with the baseline unnormalized; all other vectors L1-normalized

Again, note that: +4.9144e6 in the bottom right hand corner.

So, we seem to be going backwards.

Aside: Normalise then sum baseline, all others unnormalised

I tried a few other variations like normalising then summing, first with L1:

Class

KL Score

Good

773 262.74

KL Score Histogram with the baseline L2-normalised then summed; all other vectors unnormalised

Bad

704 254.41

then L2:

Class

KL Score

Good

94 506.53

KL Score Histogram with the baseline L1-normalised then summed; all other vectors unnormalised

Bad

83 559.17

But their summed distributions didn't give me a difference in KL scores as good as the "Normalised Baselines, No Normalisation for Others" results, so I stuck with those and I discuss just that distribution in the next section.

The ROC

Running our model against the test data, the ROC looks likes somewhat underwhelming:

ROC for "Normalised Baselines, No Normalisation for Others" KL

Or, in 3d where we can see the threshold value:

where a threshold value of about 10 is the closest the curve comes to the top, left hand corner.

It seems clear that out tool cannot with great confidence determine if a domain name is suspect or not. But then could a human? Which of these domain names would you say are bogus and which are genuine?

mxa-00133b02.gslb.pphosted.com
m7.mwlzwwr.biz
live800plus.jp
lf.wangsu.cmcdn.cdn.10086.cn
x10.mhtjwmxf.com
mailex1.palomar.edu
3gppnetwork.org
mimicromaxfinallb-1513904418.ap-south-1.elb.amazonaws.com
mkt4137.com
modt1thf4yr7dff-yes29yy7h9.stream
jj40.com

This is of course a trick question. They're all genuine URLs that Cisco have logged.

However, if our tool is used as part of a suite of metrics it might identify nefarious activity.

Conclusion

Our tool is definitely better than the monkey score but can we improve it? I have a neural net that looks promising but is computationally very expensive. The KL calculations (and variants of them) are very fast and cheap. I'll compare them to a neural net solution in another post.

Monday, December 24, 2018

Unbalanced

... or imbalanced (apparently, imbalanced is the noun and unbalanced is the verb. Live and learn).

Anyway, in a data scientist job interview I was asked how I would build a network intrusion system using machine learning. I replied that I could build them a system that was 99% accurate. Gasps from my interviewers. I said, I could even give 99.9% accuracy. No! I could give 99.99% accuracy with ease.

I then admitted I could do this by saying that all traffic was fine (since much less than 0.01% was malicious), bill them and then head home. The issue here is imbalanced data.

This is a great summary of what to do in this situation. To recap:

1. If you have plenty of data, ignore a lot from the major class. If you have too little, duplicate the minor class.

2. Generate new data by "wiggling" the original (eg, the SMOTE algorithm).

3. "Decision trees often perform well on imbalanced datasets" [ibid].

4. Penalize mistakes in misclassifying the minor data.

For this last suggestion, you can use a meta algorithm and any model but in ANNs, you get it out of the box. In DL4J, I added something like this:

.lossFunction(new LossNegativeLogLikelihood(Nd4j.create(Array(0.01f, 1f)))

"Ideally, this drives the network away from the poor local optima of always predicting the most commonly ccuring class(es)." [1]

Did it improve my model? Well, first we have to abandon accuracy as our only metric. We introduce two new chaps, precision and recall.

Precision is the ratio of true positives to all the samples my model declared were positive irrespective of being true or false ("what fraction of alerts were correct?"). Mnemonic: PreCision is the sum of the Positive Column. Precision = TP / (TP + FP).

Recall is the ratio of true positives to all the positives in the data ("what fraction of intrusions did we correctly detect?"). Mnemonic: REcall deals with all the REal positives. Recall = TP / (TP + FN).

(There is also the F1 score which "is the 2*precision*recall/(precision+recall). It is also called the F Score or the F Measure. Put another way, the F1 score conveys the balance between the precision and the recall" - Dr Jason Brownlee. This equation comes from the harmonic mean of precision and recall).

What is a desirable results is very use case dependent. But with weights 0.005f, 1f, I got:

Accuracy: 0.9760666666666666
Precision: 0.3192534381139489
Recall: 0.9285714285714286
.
.

=========================Confusion Matrix=========================
0 1
-------------
28957 693 | 0 = 0
25 325 | 1 = 1

Confusion matrix format: Actual (rowClass) predicted as (columnClass) N times
==================================================================

using faux data (6000 observations of a time series 50 data points long and a class imbalance of 100:1)

I got the job to build an intrusion detection system and these results would make the customer very happy - although I can't stress enough that this is a neural net trained on totally fake data. But if we could have 93% of intrusions detected with in only 1 in three investigations yielding a result, the client would be overjoyed. Currently, the false positives are causing them to turn the squelch in their current system too high and consequently they're not spotting real intrusions.

[1] Deep Learning: A Practitioner's Approach

Saturday, December 22, 2018

Faux data and DeepLearning4J

Like most people in corporate data science, I'm having a hard time getting my hands on real but commercially sensitive data within the oganisation. To this end, I'm writing a library to generate generic data. Sure, ersatz data is rarely decent but I figured if my models can't find the signal deliberately planted in the data, they're unlikely to find a signal in the real data.

I'm using the DeepLearning4J framework to see if the data is sensible. DL4J looks like it could be a real game-changer in the near future as it integrates easily with Spark (I ran their example with no major problems but noted it took quite a long time) and can import Keras models.

The first thing I did was try to build an Artificial Neural Net to detect unusual behaviour. One of the challenges I have in the day job is to distinguish between a network intrusion and genuine behaviour. Machines being accessed at night are suspicious but not necessarily compromised (some insomniac wants to do some work, for example). So, could a LSTM neural net (specially designed for time series) distinguish between night time and day time activity?

I stole the code for the ANN from here but came across this bug in the DL4J library, wasting a lot of time. Unfortunately, there has been no release with this bug fix so I am forced to use SNAPSHOTs. But even then, I was unable to detect the signal in the noise and kept seeing "Warning: 1 class was never predicted by the model and was excluded from average precision". Since I only have two classes (hacked/not hacked) this wasn't a great sign. Since the DL4J guys say "at this point you are just looking at a plain old tuning problem", I spent a lot of fruitless time twiddling knobs.

The problem was more profound than this. It appears I was asking an LSTM the wrong question. That is, although the data was a time series, each point in time was entirely unrelated to all the others. Changing the data to being either (1) spread randomly across a year and (2) tightly clustered around date-time gave the neural net a fighting chance of distinguishing the two. In the latter case, the data points bunch around a single date-time (imagine somebody fat-finguring their password a few times per year versus somebody trying to guess it in a small time window). In this case, each timestamp is highly dependent on the others.

Agile Java Man