Thursday, December 16, 2021

Big Data or Pokeman?

It's hard to keep track of developments in big data. So much so that there is a quiz to see if you can differentiate big data tools from Pokemon characters. It's surprisingly difficult.

So, to make it look like an old fool like me can avoid being dissed by the cool kids, this is a cheat sheet of what's currently awesome:

Trino
Once known as PrestoSQL, Trino is a SQL engine that can sit on top of heterogenous data sources. PrestoSQL is offered by AWS under the name "Athena". The incubating Apache Kyuubi appears to be similar but tailored for Spark.

Amundsen
Amundsen is a "data discovery and metadata engine". Apache Atlas is an Hadoop based metadata and governance application written in Java.

Apache Griffin
Griffin is a JVM-based tool for checking for data quality. TensorFlow Data Validation and Great Expectations for Python.

Apache Arrow
Arrow is a language-agnostic columnar processing framework. You might need it if you want to use User Defined Aggregate Functions in PySpark [StackOverflow]. It's written in a number of languages, predominantly C++ and Java and can help leverage GPUs.

Azkaban
Azkaban is a Hadoop workflow management tool from LinkedIn. It's open source and written in Java.

Koalas
Koalas brings the Pandas API to PySpark. It depends on Arrow to do this, apparently.

DBT
DBT is an open source Python project that does the T in ELT. Transforms are in templated SQL, apparently.

Apache Pinot
Pinot is a distributed, columnar data store written in Java that ingests batched and streaming data. It's a little like Apache Druid. "The only sustainable difference between Druid and Pinot is that Pinot depends on Helix framework and going to continue to depend on ZooKeeper, while Druid could move away from the dependency on ZooKeeper. On the other hand, Druid installations are going to continue to depend on the presence of some SQL database." [Medium]

Debezium
From the docs: "Debezium is an open source [Java] project that provides a low latency data streaming platform for change data capture (CDC). You setup and configure Debezium to monitor your databases, and then your applications consume events for each row-level change made to the database."

Samza
Apache Samza allows real-time analysis of streams. It comes from the same people who gave us Kafka (LinkedIn) and is written in Java and some Scala.

Pulsar
Apache Pulsar is like Apache Kafka but adds messaging on top of its streaming offering. It's mostly written in Java.

AirByte
This appears to be a half Java, half Python collection of connectors to "sync data from applications, APIs & databases to warehouses, lakes & other destinations."

Hudi
A Java/Scala Apache project that  stands for "Hadoop Upserts Deletes and Incrementals" which pretty much describes it. It can integrate with Spark.

DataHub
This is an open source "data catalog built to enable end-to-end data discovery, data observability, and data governance" [homepage] that came out of LinkedIn. It is written in Python and Java.

The adoption within the industry for big data tools can be found in this report [LinkedIn]

Andreesen-Horowitz make their predictions here.


Friday, December 3, 2021

More Logistic Regression

Log-Likelihood
How good is our model? I'm told the best way to think about the likelihood in a logistic regression is as the probability of having the data given the parameters. Compare this to what logistic regression is actually doing: giving the most probable parameters given the data.

First, some terminology:

Exo and Endo
"If an independent variable is correlated with the error term, we can use the independent variable to predict the error term, which violates the notion that the error term represents unpredictable random error... This assumption is referred to as exogeneity. Conversely, when this type of correlation exists, which violates the assumption, there is endogeneity." - Regression Analysis, Jim Frost.

"An endogenous variable is a variable in a statistical model that's changed or determined by its relationship with other variables within the model. In other words, an endogenous variable is synonymous with a dependent variable, meaning it correlates with other factors within the system being studied. Therefore, its values may be determined by other variables. Endogenous variables are the opposite of exogenous variables, which are independent variables or outside forces. Exogenous variables can have an impact on endogenous factors, however." [Investopedia]

Spark
Unfortunately, Spark does not seem to have a calculation for log-likelihood out of the box. This forced me to code my own. I looked at the code in the Python library, StatsModels, and converted it to PySpark.

Comparing the output from Spark was very data dependent. I guess this is inevitable since Spark uses IRLS [Wikipedia] as the solver and StatsModels was using l-BFGS. I was forced to use l-BFGS in StatsModels as I was otherwise "singular matrix" errors [SO].

But then, a passing data scientist helpfully pointed out that I was looking at the wrong StatsModel class (Logit in discreet_model.py not GLM in generalized_linear_model.py). Taking the GLM code, I got within 6% of StatsModels when both run on the same data. Could I do better? Well, first I rolled back the use of l-BFGS to the default solver (which is IRLS for both libraries) and removed a regularization parameter in the Spark code that had crept in from a copy-and-paste - oops. Now, the difference between the two was an impressive 4.5075472233299e-06. Banzai!

We may choose to have no regularization if we're looking for inferential statistics (which I am). This might lead to overfitting but that's OK. "Overfitting is predominantly an issue when building predictive models in which the goal is application to data not used to build the model itself... As far as other applications that are not predictive, overfitting is more secondary" [SO]

The Data
The size of data makes a difference. The more data, the lower the likelihood it is explained by the parameters. This is simply because the state space is bigger and that means there are more potentially wrong states.

Also, adding a bias of a single feature can change the log-likelihood significantly. Making a previously unimportant feature biased to a certain outcome 2/3 of the time reduced the LL in my data by an order of magnitude. Not surprisingly, if the data is constrained to a manifold, then it's more likely your model will find it.

Sunday, November 21, 2021

Why set cannot be a functor

This is well known but keeps tripping me up when I think of an example so here it is.

Set demands that its elements implement Eq (or equals() or whatever is semantically equivalent). List has no such requirement. A bogus equals implementation will lead to Set violating the functor laws. 

Imagine an implementation that looked something like:

class OddEquals(val id: Int, val label: String) {
  override def equals(a: Any): Boolean =
    if (a.isInstanceOf[OddEquals]) {
      val other = a.asInstanceOf[OddEquals]
      id == other.id
    } else false

  override def hashCode(): Int = id
}

which would be a fairly typical piece of code for objects whose identity depends solely on their primary key.

Now, image two services. One finds an entity given any related string, we'll call it f. I'll fake it like this:

  type F = String => OddEquals
  val f: F = x => new OddEquals(x.length, x)

The other service simply extracts the entity's name:

  type G = OddEquals => String
  val g: G = _.label

The functor law says:

xs.map(x => g(f(x))) == xs.map(f).map(g))

But if xs is Set("hello", "world"), the left-hand side is the same as the input but the right-hand side is Set("hello").

Compare this to List("hello", "world") and you'll see that both left- and right-hand side are equal (that is, same as xs)

Wednesday, November 17, 2021

Python's Garbage

Python's GC appears to be hugely different to the GC in Java we know and love. I have some PySpark code that collects data from a data set back in the driver and then processes it in Pandas. The first few work nicely but ultimately PySpark barfs with memory issues and no amount of increasing the driver memory seems to solve the problem. This is odd, because each collect itself only retrieves a modest amount of data. After that, I'm not interested in it.

"Reducing memory usage in Python is difficult, because Python does not actually release memory back to the operating system. If you delete objects, then the memory is available to new Python objects, but not free()'d back to the system.

"As noted in the comments, there are some things to try: gc.collect (@EdChum) may clear stuff, for example. At least from my experience, these things sometimes work and often don't. There is one thing that always works, however, because it is done at the OS, not language, level.

Then if you do something like

import multiprocessing
result = multiprocessing.Pool(1).map(huge_intermediate_calc, [something_])[0]

Then the function is executed at a different process. When that process completes, the OS retakes all the resources it used. There's really nothing Python, pandas, the garbage collector, could do to stop that." [StackOverflow]

However, this does not work well for turning PySpark DataFrames to Pandas data frames. Spark is lazy so the DataFrame cannot be serialized and given to the new process (you cannot serialize, for instance, an open network port).

Ultimately, I had to cut my losses and return from StatsModels to Spark's logistic regression implementation.

In addition to this, we have the horrors of what happens with a Spark UDF when written in Python:

"Each row of the DataFrame is serialised, sent to the Python Runtime and returned to the JVM. As you can imagine, it is nothing optimal. There are some projects that try to optimise this problem. One of them is Apache Arrow, which is based on applying UDFs with Pandas." [Damavis]

Yuck. 

Wednesday, November 10, 2021

Logistic Regression in Action

Once again, I'm looking at patient data waiting times. This time, it's the A&E department. I'm not so much interested in how long they wait but rather the binary condition of whether they waited more or less than four hours.

Logistic Regression with Spark 

I used Spark's GeneralizedLinearRegression. Hyperparameter tuning did not produce much difference in output.

Splitting the data 90/10 for train and test gave me the following output on the test curves:

Prediction where true value is 1

The above shows the predictions for patients that we know really did wait more than 4 hours for treatment. The below shows predictions where we really know they waited less:

Predicition where true value is 0

Clearly, the model has spotted something but the overlap between the two states is large. What should the threshold be to best differentiate one from the other? Fortunately, I'm not trying to predict but to infer.

You can evaluate how good the model is with some simple PySpark code:

from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol=dependent_col,                                                    rawPredictionCol=prediction_col)
auroc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
auprc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})

Evaluating with Spark's BinaryClassificationEvaluator gave me an area under the ROC curve of 0.79. Hardly stellar but not terrible.

"A no-skill classifier will be a horizontal line on the plot with a precision that is proportional to the number of positive examples in the dataset. For a balanced dataset this will be 0.5... Precision-recall curves (PR curves) are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance." [MachineLearningMastery]

Probabilities, Odds, and Odds-Ratios

"Odds are the probability of an event occurring divided by the probability of the event not occurring. An odds ratio is the odds of the event in one group ... divided by the odds in another group." [Making sense of odds and odds ratios]

"The [linear regression] model is specified in terms of K−1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one)... The choice of denominator is arbitrary in that the estimates are equivariant under this choice." [Elements of Statistical Learning]

K here is the number of groups. So, for the case where we're looking at true/false outputs, K=2 and the equations in ESL become the simple equation in the following section.

Model output

An example: "given estimated parameters and values for x1 and x2 , the model might predict y = 0.5, but the only meaningful values of y are 0 and 1.  It is tempting to interpret a result like that as a probability; for example, we might say that a respondent with particular values of x1 and x2 has a 50% chance. But it is also possible for this model to predict y = 1.1 or y = −0.1, and those are not valid probabilities. Logistic regression avoids this problem by expressing predictions in terms of odds rather than probabilities." [Think Stats, 2nd Edition, Downey]

In a two-state output, Downey's equation would be:

y = log(odds) = log(p / (1-p)) =  β0 + β1 x1 + ...

Odds(-ratios) are used when reporting effect size. "when X, the predictor is continuous, the odds ratio is constant across values of X.  But probabilities aren’t" because of that non-linear relationship above between odds and p.

"It works exactly the same way as interest rates.  I can tell you that an annual interest rate is 8%.  So at the end of the year, you’ll earn $8 if you invested $100, or $40 if you invested $500.  The rate stays constant, but the actual amount earned differs based on the amount invested. Odds ratios work the same." [TheAnalysisFactor] Since there is no bound on x1 etc, you could easily have a value greater than 1.0 or less than 0.0 which just doesn't make sense for probabilities.

Interpreting the coefficients

I have 36 features and the p-values of approximately 30 of them are all less than 0.05. This seems suspiciously good even if sometimes their corresponding coefficients are not particularly significant. "If that difference between your coefficient and zero isn't interesting to you, don't feel obligated to interpret it substantively just because p < .05." [StackOverflow]

"How does one interpret a coefficient of 0.081 (Std. Error = 0.026)? ... Incorporating the standard error we get an approximate 95% confidence interval of exp(0.081 ± 2 × 0.026) = (1.03, 1.14).

"This summary includes Z scores for each of the coefficients in the model (coefficients divided by their standard errors)... A Z score greater than approximately 2 in absolute value is significant at the 5% level." [Elements of Statistical Learning]

Intepreting the intercept

If we take the example here [QuantifyingHealth] where we're looking at the relationship between tobacco consumption (x1) and heart disease (y) over a 10 year period, then the interpretation of the intercept's coefficient depends on several conditions. 

For example, say that the intercept's coefficient is -1.93. This leads us to believe the probability of heart disease for a non-smoker (x1=0) is 0.13 by plugging -1.93 in the above equation then rearranging.

If the smoking consumption is standardized, x1=0 "corresponds to the mean lifetime usage of tobacco in kg, and the interpretation becomes: For an average consumer of tobacco, the probability of having a heart disease in the next 10 years is 0.13" [ibid].

If we're only interested in smokers not the general population, then x1≠0. "Let’s pick the maximum [623kg] as a reference and calculate the limit of how much smoking can affect the risk of heart disease... We get P = 0.22. The interpretation becomes: the maximum lifetime consumption of 623 kg of tobacco is associated with a 22% risk of having heart disease".

Thursday, October 28, 2021

Logistic Regression Notes

Just a few, miscellaneous notes on logistic regression models that I'm using Spark to build for me.

Interpreting Coefficients

The coefficients for one-hot encoded features can simply be compared to show the strength of the effect. "Things are marginally more complicated for the numeric predictor variables. A coefficient for a predictor variable shows the effect of a one unit change in the predictor variable." [DisplayR]

"How does one interpret a coefficient of 0.081 (Std. Error = 0.026) for tobacco, for example? Tobacco is measured in total lifetime usage in kilograms, with a median of 1.0kg for the controls and 4.1kg for the cases. Thus an increase of 1kg in lifetime tobacco usage accounts for an increase in the odds of coronary heart disease of exp(0.081) = 1.084 or 8.4%. Incorporating the standard error we get an approximate 95% confidence interval of exp(0.081 ± 2 × 0.026) = (1.03, 1.14)." [1]

Computational Problems

How do we avoid problems with singular matrices? "it's computationally cheaper (faster) to find the solution using the gradient descent in some cases... it works even when the design matrix has collinearity issues." [StackExchange]

"Switching to an optimizer that does not use the Hessian often succeeds in those cases. For example, scipy's 'bfgs' is a good optimizer that works in many cases"[StackOverflow]

Generalized Linear Models

"Where the response variable has an exponential family distribution [Gaussian, Bernoulli, gamma etc], whose mean is a linear function of the inputs ... this is known as a generalized linear model, and generalizes the idea of logistic regression to other kinds of response variables." [Machine Learning-A Probabilistic Perspective - Murphy]

"Software implementations can take advantage of these connections. For example, the generalized linear modeling software in R (which includes logistic regression as part of the binomial family of models) exploits them fully. GLM (generalized linear model) objects can be treated as linear model objects, and all the tools available for linear models can be applied automatically." [1]

Z-Scores

Z-Scores are "coefficients divided by their standard errors. A nonsignificant Z score suggests a coefficient can be dropped from the model.  Each of these correspond formally to a test of the null hypothesis that the coefficient in question is zero, while all the others are not (also known as the Wald test). A Z score greater than approximately 2 in absolute value is significant at the 5% level." [1]

Although this sounds like the job of a p-score, note that they are different (but related) to z-scores. A "p-value indicates how unlikely the statistic is. z-score indicates how far away from the mean it is. There may be a difference between them, depending on the sample size." [StackOverflow]

You can see quite clearly that jiggling sample size can profoundly change the numbers of the metrics. It's very useful to be able to quickly create fake but realistic data and play around and watch this effect. Given a large amount of data, you might want to reduce the p-value cut off since "in large samples is more appropriate to choose a size of 1% or less rather than the 'traditional' 5%." [SO]

[1] Elements of Statistical Learning

Saturday, October 23, 2021

Miscellaneous Spark

Feature names in Spark ML

Why isn't this better known? This is a great way to extract the features names from a Dataset after the features have been turned into vectors (including even one-hot encoding):

    index_to_name = {}
    for i in df.schema[OUTPUT_COL].metadata["ml_attr"]["attrs"]:
        meta_map = df.schema[OUTPUT_COL].metadata["ml_attr"]["attrs"][i]
        for m in meta_map:
            index = m['idx']
            name = m['name']
            index_to_name[index] = name 

where OUTPUT_COL is the column containing a Vector of encoded features.

Non-deterministic Lists

I've seen this Spark-related bug twice in our code base: even though lists have an order, collect_list "is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle." [StackOverflow]

The solution in the SO answer is to run the collect over a Window that performs an orderBy.

Wednesday, June 30, 2021

Journeys in Data Engineering

I'm currently helping a huge health provider with its data issues. Here are some things I've learned:

Don't bikeshed. There's no point wondering whether you've captured the correct ethnic description ("White? Irish White? Eastern European White?") when there are bigger fish to fry. For instance, hospitals were putting the patients' personal information in the primary key field. This sent the data officer apoplectic until it was cleansed from the files before being sent downstream. But as a result, the downstream keys were not unique. The data scientists consuming this data aggregated it and came to conclusions unwittingly based on one individual having visited the hospital over three million times.

Don't proactively go looking for data quality issues. They'll come looking for you. Continue to build your models but be extremely circumspect. This is a more efficient process than spending time wondering if the data looks good.

Just because the data looks good, it doesn't mean it's usable. How often is the data updated? Is it immutable or is it frequently adjusted? In short, there's an orthogonal axis in data space separate to data quality and it's a function of time. Perfect data that becomes old is (probably) no longer perfect. Perfect data that becomes incomplete is (probably) no longer perfect.

Give a thought to integration tests. Platforms like Palantir are pretty under-developed in this area (answer to my requests for an integration test platform: "we're working on it"). So, you may need to write bespoke code that just kicks the tires of your data every time you do a refresh. 

Remember that documentation is the code. I've had a nice experience using ScalaTest with its Given, When, Thens. It ensured that when running integration tests, the code generated the documentation so the two would never fall out of synch. This is, of course, much harder (impossible?) when running in a walled garden like Palantir.

Stick with a strongly typed language. This can be hard since pretty much all data scientists use Python (and therefore PySpark) but there is nothing worse than waiting several minutes for your code to run only to found out that a typo trips you up. In a language that is at least compiled, such problems would not occur. I'd go so far to say that Python is simply not the correct tool for distributed computing.

Python tools like Airflow are made much easier by using Docker since Python library version management is a pain in the arse. Far better for each application to have its own Docker container.

Never forget that your schema is your data contract as much as an API is your coding contract. Change it, and people downstream may squeal.

Persisting intermediate data sets hugely help debugging. 

Finally, don't use null to indicate something for which you know the meaning. If, for instance, when a record is in a certain state, it's better to say value X equals a flag to indicate that. If you use null, somebody reading that data doesn't know if the record is in this state or if the data is just missing.

Friday, June 25, 2021

Analysing a Regression Analysis Model

I'm playing around with hospital waiting lists trying to find out what factors affect waiting times. 

Using the Pearson Correlation of my features (which are the ethnic make up of a hospital waiting list and the time spent waiting), the data looks like this:

Pearson Correlation Heatmap: Ethnicity and waiting time

What if I normalise the rows?
Pearson Correlation Heatmap: normalized rows

Well that was silly. Of course there will be a negative correlation between ethnicities as the total needs to sum to 1.0.

Anyway, it was at this point I found the data was dirty beyond salvage due to upstream cleaning processes gone wrong. Having got a new data set, I tried again (this time ignoring the Pearson correlation between ethnicities):

Pearson correlation: ethnicities vs waiting list time

Note that this time, the data was standardized. It looks like waiting list time goes up for white people.

Inferential Linear Regression 

Putting Pearson correlation to one side, let's see a linear regression model trained on this data. Note, in Spark's linear regression algorithm, one is able to use ElasticNet which allows a mix between L1 (lasso) and L2 (ridge regression) regularization.

Using dummy encoding, the results look like this:

coefficient             category   p-value
2.9759572500367764 White 0.011753367452440378
0.607824511239789 Black 0.6505335882518613
1.521480345096513 Other 0.2828384545133975
1.2063242152220122 Mixed 0.4440061099633672
14.909193776961727 intercept  0.0

Hmm, those p-values look far from conclusive. OK, let's make the most populous group the zero-vector:

coefficient             category  p-value
-3.133385162559836 Mixed   0.01070291868710882
-1.7512494284036988 Black   0.10414898264964889
-1.4364038661386487 Other   0.08250720507990783
-2.3504004542073984 Asian   0.0006682319884454557
17.88573117661592 intercept 0.0

Now this is more interesting. The p-values are still a little too high to draw conclusions about two groups but it's starting to look like the waiting list size is lower if you are Asian or Mixed ethnicity.

Adding other columns makes the ethnicities at least look even more certain although the p-value for these new categories - including socioeconomic and age  -were themselves not particularly conclusive.

coefficient             category  p-value
-4.309164982942572 Mixed 0.0004849926653480718
-2.3795206765105696 Black 0.027868866311572482
-2.2066035364090726 Other 0.008510096871574113
-2.9196528536463315 Asian 2.83433623644580e-05
20.342211763968347 intercept 0.0

Now those p-values look pretty good and they're in agreement with my Pearson correlation. 

"A beautiful aspect of regression analysis is that you hold the other independent variables constant by merely including them in your model! [Alternatively,] omitting an important variable causes it to be uncontrolled, and it can bias the results for the variables that you do include in the model."[Regression Analysis, Jim Frost]

Work continues.

Thursday, June 10, 2021

Notes on Regression Analysis


Regression Analysis 

"A key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each unit change in an independent variable when you hold all of the other independent variables constant." [Multicollinearity in Regression Analysis]

This last point is key. Normalizing rows means that there will be anticorrelation between fields. This is because if one value increases the others must necessarily decrease as they all sum to 1.0.

Similarly, one-hot encoding by definition increases multicollinearity because if feature X has value 1, then I know that all the others have 0 "which can be problematic when you sample size is small" [1].

The linked article the describes how Variance Inflation Factors ("VIFs") can be used to identify multicollinearity.

"If you just want to make predictions, the model with severe multicollinearity is just as good!" [MiRA]

To remove or not?

The case for adding columns: Frost [1] has a good example on how excluding correlated features can give the wrong result. It describes how a study to test the health impact of coffee at first showed it was bad for you. However, this study ignored smokers. Since smokers are statistically more likely to be coffee drinkers, you can't exclude smoking from your study.

The case for removing columns: "P-values less than the significance level indicate that the term is statistically significant. When a variable is not significant, consider removing it from the model." [1]

Inferential Statistics

But what if we don't want to make a model? "Regression analysis is a form of inferential statistics [in which] p-values and coefficients are the key regression output." [1]

Note that the p-value is indicates whether we should reject the null hypothesis. It is not an estimate of how accurate our coefficient is. Even if the coefficient is large, if the p-value is also large "the observed difference ... might represent random error. If we were to collect another random sample and perform the analysis again, this [coefficient] might vanish." [1]

Which brings us back to multicollinearity as it "reduces the precision of the estimated coefficients, which weakens the statistical power of your regression model... [it] affects the coefficients and p-values." 

["Statistical power in a hypothesis test is the probability that the test can detect an effect that truly exists." - Jim Frost]

[1] Regression Analysis, Jim Frost

Sunday, May 9, 2021

Higher Dimensional Distances

I was happily finding the Euclidean distance between vectors in a higher dimension feature space when my boss told me that Euclidean distance becomes less informative at greater dimensions - see the paper On the Surprising Behavior of Distance Metrics in High Dimensional Space.

Some more Googling shows this is a pretty well documented phenomena:

"Our intuitions, which come from a three-dimensional world, often do not apply in high-dimensional ones. In high dimensions, most of the mass of a multivariate Gaussian distribution is not near the mean, but in an increasingly distant “shell” around it; and most of the volume of a high-dimensional orange is in the skin, not the pulp. If a constant number of examples is distributed uniformly in a high-dimensional hypercube, beyond some dimensionality most examples are closer to a face of the hypercube than to their nearest neighbor. And if we approximate a hypersphere by inscribing it in a hypercube, in high dimensions almost all the volume of the hypercube is outside the hypersphere.  This is bad news for machine learning, where shapes of one type are often approximated by shapes of another." A Few Useful Things to Know About Machine Learning

We can show this with some code here. Given uniformally distributed points in a d-dimensional hypercube, the ratio of the difference between the largest and smallest separation to the overall largest gap behaves like this:

distance ratio over number of dimensions of uniformally distributed points

Alternatively stated, the minimum distance in a cluster of 100 points uniformally distributed compared to the maximum approaches 1 as the dimensions approach infinity:


Doomed?

But "signal-to-noise matters" [SO]. Note that the above examples used pure randomness in the data.

Instead of using random vectors, if we deliberately contrived two clusters and looked at the ratio of average distances within a cluster to the average distance between two clusters, the results look more like this:

Ratio of distances within vs distances between two clusters

where ratios quickly settle down. "Mapping your data set x,y→x,y,x,y,x,y,x,y,...,x,y increases representative dimensionality, but does not at all make Euclidean distance fail. (See also: intrinsic dimensionality) So in the end, it still depends on your data. If you have a lot of useless attributes, Euclidean distance will become useless. If you could easily embed your data in a low-dimensional data space, then Euclidean distance should also work in the full dimensional space." [ibid]

So, although my boss is technically correct, you may not have to worry about it. My algorithm went on to produce acceptable results.

Friday, March 26, 2021

Unsupervised nearest neighbours

Interesting problem: we have 33 thousand 'areas' in the UK for the purposes of health. People would like to compare one health area with another that has similar demographics etc. We'd like to find these peers automatically using machine learning but nobody really knows what makes a neighbour in feature space an actual neighbour. 

This is an unsupervised ML task. We cannot use the go-to clustering algorithm KNN (where a class is elected given k nearest neighbours as calculated with some arbitrary distance metric) as this is a supervised learning algorithm and each vector is its own distinct class. That is, there is no class with more than one data point in it.

So, how do we measure how good our algorithm is? We might like to use hypothesis testing where we compare the 10 nearest neighbours (k=10 for us) from our algorithm against 10 randomly chosen feature vectors (null hypothesis). For both these samples, we follow this process: take, say, 200 other vectors randomly chosen (synthetic data) and for each feature for each of the 200, calculate whether it's within the min/max range of the sample (test statistic). With this we can calculate whether we're getting results better than chance (p-test). So, we have fulfilled all of Downey's criteria for a hypothesis test in "There is still only one test" blog post

Of course, we would jolly well hope that our results are better than chance but this introduces another problem: when every combination of feature engineering scores 100%, which is the best? We could take the difference between the test statistic for the null hypothesis sample and the actual sample. This is what we did in the end but it is one of any number of metrics.

Another metric we tested was the Pearson correlation statistic in measuring each feature vector in a sample with all the others. If the average score for our actual sample is greater than the null hypothesis then we know we're on the right path. Unfortunately, this proved to be not great. Thinking about it, it's not too surprising since the vectors are not random numbers but the demographics within (roughly) equal sized regions in the same country.

The Algorithm

Having established a test statistic, we then need to choose an algorithm and features.

Using the Freedman Diaconis algorithm to generate optimal bucket sizes from which probabilities could be calculated, I then used the Jensen Shannon to calculate nearest neigbours. However, our data was Zipfian so this meant almost everything fell into the same bucket. With some hacking, I resized the buckets so that there were many around the sharp peak and few at the edges. However, the score from using JS was still not great. I'm told because JS is better for finding outliers, which also jives with my experience.

Instead, the algorithm we used was L2 Euclidean distance. Bearing in mind that  Euclidean distance is not a good metric in high dimensions [SO], our test statistic looked good when we applied it to bucketed (!) and scaled data. What's more, it was computationally cheap (remember that to compare all nodes is an O(N2) problem) and easy to explain to the client.

Palantir

For this work, I was asked to use Palantir technology which was pretty easy since it seems to be a thin wrapper around Spark. Qualms aside about the selling at high prices open source technology (with some features actually crippled), some best practices I came to acquire:

  • Don't have multiple tabs open on the codebase. Apparently, you can download the code into your favourite IDE but you won't have the data. So, at some point, you'll be coding in the browser. If you have multiple tabs open, Palantir gets confused and overwrites versions. Not a great user experience.
  • Create a pyspark file for each combination in the pipeline that is only a couple of lines long which delegates to shared code. This way, it's easier to explore lineage with the Monocle tool. This is actually quite a nice feature of Palantir giving you a graphical representation of your pipelines.


Monday, March 1, 2021

Spark Code Submissions

Hurray! My Py/Spark pull request to optimize hyperparameters using randomization has been accepted and merged to branch 3.2!

Here are some things I have learned about maintaining the Spark codebase.

  1. Scalacheck is used extensively. This is a great little library that I should really use more in my work code. It checks corner cases such as: if your function needs an Int then how does it handle MaxValue? How about MinValue?
  2. Unfortunately (in my opinion) the documentation is generated with Jekyll which runs on Ruby requiring me to install all sorts of weird (to me) software on my system. This StackOverflow post helped me avoid doing all sorts of nasty things to my laptop as root. Documentation is generated with bundle exec jekyll build which takes the .md files and enhances them so tools like IntelliJ that give you a markdown viewer are not terribly useful. You have to spend minutes building the site each time you change it, it seems. Compare that to mdoc which via an SBT plugin allows you do things like have the documentation built the moment you change a file with docs/mdoc --watch.
  3. The style checks are draconian with even the spaces between imports being significant. I guess this makes a lot of sense in that code reviews are now more automated (I was surprised that it was only really the overworked Sean Owen who was reviewing my code) but was annoying to have builds fail because of very minor infractions. The $SPARK_HOME/dev directory however has some scripts that let's you run this kind of thin on your own laptop first.
  4. All code must be accessible to Java, Scala and Python. Scala/Java compatibility is not without its wrinkles but PySpark code is a parallel code structure that mostly wraps Spark. Where it doesn't (for example, code on the client side), the Python must have as similar an interface to the JVM classes as possible.

All in all, quite difficult but I think this is a fixed cost. My next pull request should be much faster.

Exception, Exception, Exceptions

I asked this question in the Discord channel about ZIO 1.0.3:

If I have a ZIO[Any, Throwable, String] that just throws an Exception then I can handle the exception with ZIO.either

But if I have a ZIO[Any, Throwable, String] that throws an Exception in the release of bracket I get:

    Fiber failed.
    An unchecked error was produced.


Is that expected? [Code to demonstrate here on GitHub]

Adam Fraser responded:

@PhillHenry The release action of bracket should never fail so it has a type of URIO[R, Any]. If there is an error there you need to expose the full cause and either handle it or ignore it. 

You are kind of lying to the compiler by doing URIO(x.close()).

URIO promises that something will never fail but close can fail so that is going to create an unchecked exception. 

If you make that something like Task(x.close()) to reflect the fact that the effect can fail then you are going to get a compilation error when you try to put that in bracket. 

And then you have a choice as the user. If a finalizer fails, which really shouldn't happen, how do you want to treat it? One option is to say just fail at that point. Not successfully running the finalizer indicates a potential resource leak and better to fail fast than die slowly later. So then you would call orDie on the effect, and you could potentially handle the cause at a higher level of your application. Another option is to say we made our best efforts, we just need to move on, and then you would do ignore or maybe log the error somewhere and then ignore it.

Why couldn't the Exception manifest itself in the error channel of the ZIO?

Because the error channel models the failure of the acquire and use actions of bracket. It is possible that the use action fails and then the release action runs and also fails, so we need different channels for those errors.

Could ZIO not use Throwable.addSuppressed? I notice that this is what Java's try-with-resource does in these situations

Well we have to be polymotphic. The error of use may not be a Throwable at all so we definitely can't add a suppressed exception to that.

That also makes it really easy to lose failures in finalizers when normally a failure in a finalizer is very bad and something you should explicitly handle.

Well the only way you are cheating the compiler is by doing UIO(somethingThatThrows). I'm not sure how that can be prevented other than by preventing users from constructing UIO values at all, which prevents users from describing effects that really can't fail. 

I think the lesson is just to only use UIO for effects that really don't throw exceptions (or if they do throw exceptions you are comfortable treating those as defects).

Can we not force bracket to be available only if the error channel is a Throwable?
But we want to use bracket in a ton of situations where it is not.
The bracket operator is core to safe resource management. Saying we could only use it when E was Throwable would prevent us from writing resource safe code with a polymorphic error type, which is basically all code we want to write.

Disquieting

If a ZIO can throw any Exception to bring the whole machinary to a crashing halt, why have an error channel in the first place? This seems like ZIO is a leaky abstraction. And this is what bothers me although colleagues have told me they don't find it a worry at all.

The Cats way

Drew Boardman @drewboardman Feb 23 22:15
Basically I'm trying to see if there exists something like MonadError that signals whether the error has been handled but I think this is only possible with datatypes, not with typeclasses. I was having a discussion about how MonadError doesn't really signal, at the type-level, that anything has been handled. I basically got around to just creating a datatype that signals this - but that effectively just re-invents Either
Adam Rosien @arosien Feb 23 22:18
"whether the error has been handled" - do you mean knowing this statically? MonadError and the like don't distinguish, in the type, error-handling.
Fabio Labella @SystemFw Feb 23 22:18
It's kinda possible with cats-mtl , but in general yeah it's a lot easier with datatypes... So in this case you'd have a MonadError constraint, and handle it (eliminate it) by instantiating to EitherT (locally) and then you handle that. 
To recap, MonadError tells you that the called code has the right to raiseError and if it does, what type this will be. But there is no guarantee that an IO will not throw an Exception that brings the JVM crashing down. 

IOs that throw Exceptions  and MonadErrors that raiseError can be .attempted to get an IO[Either[. That is, the datatype not the effect indicates whether the exception has been handled or not.

Thursday, February 18, 2021

PySpark for JVM Developers

PySpark uses Py4J to talk to Spark. "Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine" [Py4J]. The Python class structure mirrors the Scala structure closely but is otherwise a totally separate codebase. Note that the functionality is not necessarily 1-to-1. Parts of Spark only available to the JVM. For instance, Dataset.groupByKey is not available to Python in Spark 2.x, although it is in Spark 3.

PySpark within notebooks

Python integrates with Spark quite nicely in Jupiter and Zeppelin. You just need some code that looks like:

%pyspark

from pandas import DataFrame

dfSpark = sqlContext.sql("select * from MY_TABLE")
df = dfSpark.toPandas()
...

and you can be manipulating your data in Pandas. But this is hard to test.

If you want to run some Scala on data that was processed with Python, you can always run something like this in the pyspark cell:

pythonDF.registerTempTable("TableName")

and this in a %scala cell:

val df = sqlContext.sql("select * from TableName")
 
thus passing the data between the two languages.

PySpark outside notebooks

Automated testing of Spark via Python is available via PyTest (see this SO answer of some sample code). It fires up a Spark JVM instance locally in the background and stops it when the tests are done. The JAR for this Spark instance will live in your site-packages/pyspark/jars/ directory and the JVM in your PATH environment variable will be used.

Note that it's much easier to do all of this in the JVM using the BigDataTesting library which allows Hadoop, Spark, Hive and Kafka to be tested in a single JVM [disclaimer: I'm the author]. But since Python is being used more and more in data engineering, it appears possible [SO] to do this in via Py4J as well.

Building and deploying

To create a deployable artifact, you can package your code in a .whl ("wheel") file which can be easily deployed to something like DataBricks. I have a simple PySpark project in my GitHub repository that demonstates how to test and package Spark code in Python.


Friday, January 29, 2021

Spark Numerical Optimization

Neural Net Gradient Optimization

Spark's implementation of artificial neural nets is a little sparse. For instance, the only activation function is the private SigmoidFunction. It seems like there are other ways of training large ANNs at scale. This is an interesting article where data scientists used Deep Java Library (DJL - not to be confused with DL4J) in a Spark framework to orchestrate PyTorch model training.

In theory, one could add one's own activation functions and extract the weights that Spark generated. I took a look at the code of MultiLayerPerceptronClassifier and saw how in each epoch, it broadcasts the weights to all the executors and then uses RDD.treeAggregate to compute the gradient per vector, per partition and then combine them all. In the case of BFGS numerical optimization technique, the Breeze library is used and the call to treeAggregate was invoked as a callback to its BFGS implementation. Note that this was done on Spark's driver and the the loss and averaged gradient are calculated there.

Here, "we cannot write down the actual mathematical formula for the function we’re optimizing. (The technical term for the function that is being optimized is response surface.)" [1]

The format of the broadcast weights confused me at first as they're a one dimensional vector when I was expecting matrices (see Neural Nets are just Matrices). On closer inspection I saw that this vector contained all the values in all the matrices for all the layers and was sliced and diced accordingly.

Hyperparameter Optimization via Randomisation

One level up, a meta-optimization if you like, we want to tune the parameters for the neural nets (choice of activation function, size and cardinality of layers etc). Spark only offers ParamGridBuilder out of the box. This is really just a simple class that creates a Cartesian product of possible parameters. This is fed to CrossValidator.setEstimatorParamMaps as a simple Array so the cross validator can explore a parameter space by brute force.

There is a more clever way that involves a random search. "for any distribution over a sample space with a finite maximum, the maximum of 60 random observations lies within the top 5% of the true maximum, with 95% probability" [1]

It seems odd but it's true. Let's look at the probability of not being in the top 5%. And let's call this top 5% space, X. Then:

P(x ∈ X) = 0.05

so, the probability of a random point not being in the top 5% is:

P(¬x ∈ X) = 1 - 0.05

Now, given N points, the probability of them all not being in the top 5% is:

P(¬xi ∈ X ∀i) = (1 - 0.05)N

So, the probability being in the top 5% is one minus this. 

Say, we want 95% probability we're in the top 5%, solve this equation and you'll see N is slightly less than 60. QED.

Exploration of Space

The state space of parameters can be explored for maxima or minima via various techniques. There are more intelligent ways of doing it but one may is using Monte Carlo techniques. Here in my GitHub account is a Python/Theano/PyMC3 example using Metropolis-Hastings to sample from a beautifully named Himmelblau function. This function is nice as we know analytically that it happens to have four local minima. But let's pretend we don't know that:

The Himmelblau function (top) and finding its minima (bottom)

The path to explore the space starts in a random place but quickly finds one of the local minima.

We might be lucky and the algorithm (non-deterministically) jumps out of its local minima and finds another:
Finding two (equal) local minima.

But Metropolis-Hastings generally needs a lot of samples to find all local minima (in the diagram above, it was 30k and we only found two of the four). But it can find at least find some local minimum quite easily in as few as 60 steps:

Local minima (3.0, 2.0) found in 60 steps

The MH algorithm adapts with each step whereas the Spark cross validation API expects the parameter co-ordinates to be fed to it a priori. If Spark is to have a more intelligent search space agent, it won't be via this class.

[1] Alice Zheng on O'Reilly

Sunday, January 24, 2021

Feature Engineering Time Series


"The loss of information that I mentioned above could actually be a useful feature of the Fourier Transformation. Let me explain. Your timeseries probably has some noise in it. One very useful thing that you can do in your feature engineering process would be to take the Fourier Transformation of your time-series. It will probably look pretty chaotic when you do that, because there is some noise built into the signal. So now that you are in the frequency-domain, you can just threshold your transformed data-set. You can do this by simply setting an arbitrary threshold say any frequency with an amplitude greater than 10, any amplitude less than 10, you set to zero. And then you can transform the series back to the time-domain. What you will find when you do this is that your time-series has been simplified. There is less noise. Essentially, what you have done is you have applied a filter to your data. You have filtered out the components of the signal that are likely noise." - Ryan Barnes

Some Fourier transform examples

With code stolen from here [SO] we can see an example:

"A property of the Fourier Transform which is used, for example, for the removal of additive noise, is its distributivity over addition." [1]

When you look at the code to generate this, you'll see it mention it will try to "remove DC". In this context, DC means the average (that is, frequency 0) values. This is a historical legacy of electrical engineering where DC means the Direct Current rather than the waves that make AC electricity.

Now, let's try a 2D representation where I'm actually trying to simulate human behaviour. Let's say it's people going to Church or Synagogue one day per week. We can do this "because the Fourier Transform is separable ... Using these two formulas, the spatial domain image is first transformed into an intermediate image using N one-dimensional Fourier Transforms. This intermediate image is then transformed into the final image, again using N one-dimensional Fourier Transforms." [1]

Can Fourier transforms pick it out of the data? (Code is on my GitHub repository):
Fig 2: Regular activity. Time is the y-axis in the top graphic. The second is frequency space. The third, amplitudes per column.

The first thing to notice is that instead of our spectrum having a peak both side of the origin for the frequency, there are many peaks at multiples of the frequency. These are the harmonics. The reason Figure 1 does not have them is that the signal is made of two pure sinusoidal waves. There are no other frequencies at play.

Now, let's add some noise.
Regular activity with noise
Our regular periodicity is still the most prominent signal in frequency space and the lowest harmonic does indeed correspond to to a regular event - in this case, it's something that happens every 8 days and the lowest harmonic is indeed at 0.125.

Worked example

So much for the theory, let's examine the practise. I note that my /var/log/syslog has lots of noise in it. I want to de-noise it to see if there is something suspicious in it. I used this code in my GitHub repository to spot irregular activity. Note, this is a very simple implementation that just looks at the process that logged an event and its time. I'll make it more sophisticated later.

Anyway, running the parsed log through a Fourier transform resulted in this:

A Fourier transform of syslog events

We can see that there are clearly some regulars so taking them out, we see that the three (arbitrary number alert) most irregular events came from anacron, systemd and nvidia-persistenced. Looking at these logs in syslog did indeed indicated that they were isolated events albeit nothing to worry about (dockerd and containerd were the most regular naggers).


Friday, January 15, 2021

Neural Nets and Anomaly Detection pt 1


More playing with anomaly detection, this time with Keras. All the data is synthetic and can be generated from the code on my GitHub here.

The Data

Our data looks like this:

Typical synthetic data

The top row shows noisy data that have a clear global periodicity. The bottow row shows noisy data with each column having a periodicity independent of the other columns.

Can a Variational Auto Encoder tell the difference?

The Model

The code for our neural net was adapted from this StackOverflow answer here. Very quickly (roughly 10 epochs), the VAE was able to differentiate which group a given square was in with an accuracy of about 90%.

The data projected onto the VAE's bottleneck

Note that I am not going to tune the VAE to get the last drop of accuracy. I'm happy with about 90%.

Note that the trick with VAEs is that the data with which we train it is also the data that we expect. This is simply because we're asking the neural net to reproduce what it's been given. It won't be exact at all since we're deliberately compressing the data through a bottleneck. This is the very definition of an auto encoder. The 'variational' term comes from the fact that the neural net learns the probabilities of the distribution representing the data [Quora].

The Results

The trouble with VAEs is that although you might be able to print a pretty graph of the rendered data, you might need some algorithm to differentiate all the data for you. I'm going to use SciKitLearn to run a KMeans algorithm since I know there are exactly 2 groups (that is, k=2 here).

This is fine in this example where I have two groups of 256 elements. What if we're faced with imbalanced data and are looking for outliers? KMeans does not do so well with that. It doggedly tries to find two clusters of roughly equal size giving an accuracy of about 50% - the monkey score.

Spot the outlier

To this end, we can use the DBScan algorithm, again from SKLearn. This finds my outlier but does take a bit of tuning.

Still playing...


Wednesday, January 13, 2021

Hypothesis Testing Notes

Recipe

A hypothesis test is only a partial answer to the question "am I being fooled by randomness?". It does not address sampling bias nor measurement errors both of which might be responsible for the apparent effect. [1]

The way to put a hypothesis test together is:

1. Define a test statistic.
2. Define a null hypothesis.
3. Ensure the model can generate synthetic data.
4. Calculate the p-value by generating data (step 3) that fulfills the test statistic.

"All hypothesis tests fit into this framework." [2]

p-value

"Ideally we should compute the probability of seeing the effect (E) under both hypotheses; that is P(E | H0) and P(E | HA).  But formulating HA is not always easy, so in conventional hypothesis testing, we just compute P(E | H0), which is the p-value" [3]

Given our recipe above, this is how we'd calculate the p-value of a (potentially) crooked dice.

1. Run 1000 simulations
2. Find the chi-squared value for each
3. Count the number above a threshold defined by the chi-squared value for the (expected, actual) tuple.
4. Divide by the 1000 trials and you have the p-value

Note that "the p-value depends on the choice of test statistic and the model of the null hypothesis, and sometimes these choices determine whether an effect is statistically significant or not." [4] "If different models yield very different results, that's a useful warning that the results are open to interpretation." [2]

Worked Example

Problem: In ThinkStats2, the assertion is made that first babies arrive earlier than subsequent babies. Given pregnancy duration for 4413 first babies and 4735 subsequent births, there is a 0.078 week different between the two means. Is this significant?

Methodology: we repeatedly sample pairs of births and measure their difference. After N samples, we calculate our mean difference.

With the null hypothesis, we assume there is no effect and sample from all data without distinguishing between the two labels. We do this for N iterations by shuffling the original data and assigning each datum to one of two groups. Then, we take the difference in the means of those two groups. After N samples, we calculate our mean difference.

Think Stats2 - Downey

Thursday, January 7, 2021

Kleislis


What are they?
Accordingly, the functions that we try to compose are actually A => Monad[B] . A wrapper around them, a category that is naturally associated with a Monad[B], is called a Kleisli category. A => Monad[B] or Kleisli arrows is just a way to compose these sort of functions, nothing more... 
Fun fact: If you compose Kleisli arrows for IO monad, you will get a description of your computer program. Your computer program is essentially one gigantic Kleisli arrow, with some input and output of Unit that acts as a description, and a runtime environment that executes this program works as an interpreter.
[Pavel Zaytsev on Medium.com]

Rob Norris @tpolecat Nov 20 20:39
Kleislis compose, the equivalent bare functions don't. It's just a matter of convenience.
Kleisli[F, A, B] and A => F[B] are isomorphic, pick whichever works better.
John D Cook has a blog post here how the use of Kleislis could have saved the Mars lander from failing (TL;DR; the conversion from kilometers to miles was done twice).

Why are they useful?

They compose whereas monads (usually) don't.
If you know how to map over A[X] and over B[X] you also know how to map over A[B[X]]. Automagically. For free. 
This is untrue for Monad: knowing how to flatMap over A[X] and B[X] doesn’t grant you the power of magically deriving a flatMap for A[B[X]]
It turns out this is a well known fact: Monads do not compose, at least not generically.
[Blog of Gabriele Petronella]



From Rob Norris' talk, Functional Programming with Effects

[Rob's talk is here].

Using Haskell notation: "there is a way to compose functions like a -> m b and b -> m c into a -> m c (like the function here, from String to Set[String]). They are called Kleisli functions" [SO]

Ryan Peters [@sloshy on Gitter Jul 14 18:21 2020] describes why he likes Kleisli's for Dependency Injection:
damnit I keep stumbling into Kleisli by accident... 
My architecture as of late has looked something like this:
Config object defining how to read in config arguments/env (using Decline or Ciris)
Dependencies case class that has a constructor which depends on that Config and tries to read it in, converting those values to all of my dependencies.
Anything that needs different dependencies, like concurrent jobs or other processes, is defined as a Kleisli as mentioned.
Main loads a single Resource[IO, Dependencies] and converts all of my Kleisli jobs into Kleisli[IO, Dependencies, ExitCode] and runs it at once.
Kleisli is intimately related with "first do X then do Y" (sequencing) [Gavin Bisesi Gitter May 06 21:02, 2020]. Also, note that ReaderT == Kleisli [Adam Rosien Gitter May 06 19:28 2020]

Usage in HTTP4S

[Gabriel Volpe @gvolpe Jul 06 16:17 2020]
So, HttpRoutes is an alias for 

Kleisli[OptionT[F, *], Request[F], Response[F]]

We say 

def handle(routes: HttpRoutes[F]): HttpRoutes[F] = ...

So that's the type we are expecting to have (we are annotating our function [with] the expected type), meaning we don't need to do the same again within the Kleisli block because the types can be inferred. It would be a different story if we wouldn't have told the compiler what the expected type is.

Then we have the implementation:

Kleisli { req =>
      OptionT {
        A.handleErrorWith(routes.run(req).value)(e =>
          A.map(handler(e))(Option(_)))
      }
    }

Kleisli { ... } is the same as calling Kleisli.apply. So [within] this block, we expect the type to be 
Request[F] => OptionT[F, Response[F]]

Remember that req can be inferred by the compiler, so we know we have req: Request[F]

Then again we call OptionT { ... } which is the same as OptionT.apply. So [within] this block, we expect the type to be 
F[Option[Response]].

Now we get to the final block

A.handleErrorWith(routes.run(req).value)(e => 
   A.map(handler(e))(Option(_))
)
 
Let's recap on the types we have

routes:  HttpRoutes[F]
req:     Request[F]
handler: E => F[Response[F]]
A:       ApplicativeError[F, Throwable]

A good exercise would be to split every single part to understand what the types are

routes.run(req) effectively runs the Kleisli by feeding a Request[F], which gives us an OptionT[F, Response[F]]. So we call .value on it to get F[Option[Response[F]]

Now here's the type signature of handleErrorWith

def handleErrorWith[A](fa: F[A])(f: E => F[A]): F[A]

fa is F[Option[Response[F]] for us
so the second part should be 
E => F[Option[Response[F]] 
but notice that handler is defined as 
E => F[Response[F]] 
so we perform that final map(Option(_)) to lift that Response[F] into an Option[Response[F]].

A worked example using HTTP4S

I found this sample on GitHub informative. It presents an HTTP server using Http4s, a database from H2 (from a quick sbt 'show dependencyClasspathFiles' or sbt dependencyTree), Quill and even Flyway to update the database schema. All of this in less than 150 lines!

The whole app is basically this line:

  private val httpApp: Kleisli[IO, Request[IO], Response[IO]] = (videos <+> tags <+> videotags).orNotFound 

where videostags and videotags are all HttpRoutes which is just an alias as defined above by Gabriel.

But what is this <+> magic? Well, we're combining Kleisli's by treating their content as semigroups. The results seemed a little counter intuitive to me to begin with. Run this code:

  val add1      = Kleisli[List, Int, Int](x => List(x + 1))
  val divide10  = Kleisli[List, Int, Int](x => List(x / 10))

  val add1Div10 = add1 <+> divide10

  println(add1Div10.run(99))

and you'll see:

List(100, 9)

Straight forward so far. The call to Kleisli.run executed both functions on the input and used an implicit semigroup to glue the answers together within the List.

But HttpRoutes does not deal with Lists, it deals with OptionTs - a monad transformer for Options. How they behave is demonstrated in my GitHub project. Basically, the binary operator defers to OptionTSemigroupK and it looks like this:

def combineK[A](x: OptionT[F, A], y: OptionT[F, A]): OptionT[F, A] = x.orElse(y)

For example, if our F was a List[Int] (that is, we're gluing together OptionT[List, Int]s) then calling <+> on them will result in a Kleisli whose value is a list with just the first Some (or a single None if they're both Nones). 

Note that if our F were IO then the effect associated with any but the first OptionT would not be run (as demonstrated here in my GitHub). In our HTTP app, the flow short circuits on the first combined function that gives a satisfactory answer.

But now we're getting away from Kleisi's and into semigroups. Suffice to say, composing the Kleisli will glue the results together according to the laws of their semigroup.