Saturday, September 28, 2019

Applicative Functors and Validation


Applicatives

Laws Applicative instances must follow:

map(apply(x))(f)         == apply(f(x))
join(apply(x), apply(y)) == apply((x, y))

Recall that "functions are applicative functors" (LYAHFGG) and applicatives have a function of type:

f (a -> b) -> f a -> f b.

So, where is this function on scala.Function1 etc?

Basically, the authors of Scala didn't put that functional programming goodness into the standard library and you have to roll-your-own or use a library like Cats (see below).

Applicative Functor

I liked the second definition here at StackOverflow which basically says the defining Functor function is:

  def map[A, B](f : A => B): C[A] => C[B]

"And it works perfectly for a function of one variable. But for a function of 2 and more, after lifting to a category, we have the following signature:"

val g = (x: Int) => (y: Int) => x + y

for example:

Option(5) map g // Option[Int => Int]

The defining Applicative function is:

  def apply[A, B](f: F[A => B]): F[A] => F[B]

"So why bother with applicative functors at all, when we've got monads? First of all, it's simply not possible to provide monad instances for some of the abstractions we want to work with—Validation is the perfect example. Second (and relatedly), it's just a solid development practice to use the least powerful abstraction that will get the job done. In principle this may allow optimizations that wouldn't otherwise be possible, but more importantly it makes the code we write more reusable"
[ibid]

Validation

The author of this answer raises a good example of  where monads are not appropriate, something that struck me this week when I was writing validation code.

Monads are great. Why they're great is because I don't need to change my code when I mess up. For instance, I had some parsing code that looked like this:

  def parse: T[String] = for {
    a <- parent
    b <- parseElementA
    c <- parseElementB
  } yield {
    ...
  }

and naively my parseXXX functions return Options. The problem here is that if we fail to parse an element, None doesn't tell us why. No worries, let's use Either which (since Scala 2.12) is a monad too! Now my parseXXX methods will tell me if they fail why they fail and I never had to change the above block of code!

The next problem occurred when the QA told me that he must run the whole application again to find the next error in the data he is feeding into it. In the cloud (GCP), this is a royal pain. So, wouldn't it be great to aggregate all the errors?

As mentioned in the SO answer above, it's simply not possible to do this with monads. Fortunately, "there's a simpler abstraction—called an applicative functor—that's in-between a functor and a monad and that provides all the machinery we need. Note that it's in-between in a formal sense." [SO]

Cats and Validation

Cats has a type to help here called Validated. But note that "what’s different about Validation is that it is does not form a monad, but forms an applicative functor." [eed3si9n]

So, with a few imports from Cats, I can write the even more succinct:

  def doApplicatives: T[String] = (parseParentparseElementAparseElementB).mapN { case (x, y, z) =>
    ...
  }

What's more, since Options and Eithers are also applicatives (as are all monads) this code still works just as well with them because "every monad is an applicative functor, every applicative functor is a functor, but not every applicative functor is a monad, etc." [SO]

Note that Scalaz also gives us an applicative functor instance for Option, so we can write the following:

import scalaz._, std.option._, syntax.apply._def add(i: Int, j: Int): Int = i + j

val x: Option[Int] = ...
val y: Option[Int] = ...

val xy = (x |@| y)(add)

So, for validation, don't use monads, use applicatives and some library to add syntactic sugar.

Sunday, September 22, 2019

Git strategies


I've normally worked in a team where each developer has his own feature branch. My current team instead has each developer working on his own fork. The reason for this is the main branch is not "polluted" with long-dead development.

In addition, squashing the commits (see below) means each feature only adds one commit to the main codebase.

Forking code

For a codebase with:

git clone FORK_TO
git remote add ARBITRARY_NAME  FORK_FROM
git fetch ARBITRARY_NAME
git checkout -b ARBITRARY_BRANCH_NAME
git merge  --allow-unrelated-histories ARBITRARY_NAME/master
git push --set-upstream origin ARBITRARY_BRANCH_NAME

Origins and aliases

Git's remote labels are just that - aliases for URLs of repositories. To see them, run:

git remote -v

"Remotes are like nicknames for the URLs of repositories - origin is one, for example." [SO]

By convention, "upstream generally refers to the original repo that you have forked ... origin is your fork: your own repo on GitHub, clone of the original repo of GitHub" [SO]

You can delete remote labels with:

git remote rm upstream

Note that this cannot in itself damage your codebase.

Anyway, having forked from the main repository, you may want to add an alias like this:

git remote add upstream URL_OF_CODE_WE_FORKED_FROM

Briefly, "you have access to pull and push from origin, which will be your fork of the main ... repo. To pull in changes from this main repo, you add a remote, upstream in your local repo, pointing to this original and pull from it.

"So origin is a clone of your fork repo, from which you push and pull. Upstream is a name for the main repo, from where you pull and keep a clone of your fork updated, but you don't have push access to it.” [SO]

Merging from the main branch

You might first want to see what change you're pulling [SO] so do a:

git diff COMMIT_ID~ COMMIT_ID

This assumes the previous commit "squashed" their commits (see below). This makes the main fork cleaner.

You can checkout your personal main branch and run:

git pull upstream MAIN_BRANCH

or a:

git reset --hard upstream MAIN_BRANCH

rather than a rebase

You may sometimes have Git tell you that you are working on a detached head. What does this mean? “You are not on a branch (the commit is likely to be on multiple branches). You are checked out to a specific instance in the history… You are checked out to a specific commit.” [SO]

After doing this, your personal main branch has the team's changes and you can merge like usual.

If you really foul things up after pushing to your personal code base, fear not. You can rollback a commit with:

git revert THE_HASH_FOR_COMMIT_YOU_WANT_TO_FORGET

this will leave commit in the log though. See the line “less dangerous approach” here [SO].

Merging to the main branch

When pushing your code to the main branch, you can "squash" all the commits. This means that in the future, developers will just see the salient changes without all the commits that went to make that featuer.

You can squash commits with:

git rebase -i MAIN_BRANCH

You'll be asked to edit a file. Keep the first line and replace 'pick' in subsequent lines with 's'. Then:

git push -f origin YOUR_BRANCH

Now you're good to merge with the team's codebase, although you may not have privileges to do that...

Note: you might want to squash your code before merging changes from another branch as this makes any conflicts much easier to resolve.

Miscellaneous

Some people use fetch and some use pull. What's the difference? “In the simplest terms, git pull does a git fetch followed by a git merge.

"You can do a git fetch at any time to update your remote-tracking branches under refs/remotes//.

"This operation never changes any of your own local branches under refs/heads, and is safe to do without changing your working copy… A git pull is what you would do to bring a local branch up-to-date with its remote version, while also updating your other remote-tracking branches.” [SO]

Rolling back

How to rollback depends on what exactly you mean by rollings back (do you want a record of the abandoned commits or not? etc). The simplest way is to just checkout [SO] like:

git checkout HASH_OF_DESIRED_COMMIT .

and note that '.' at the end of the line. You can then git push origin HEAD:YOUR_BRANCH

There are variants on this but this way will retain your history - which may or may not be what you desire. For instance, this line above will literally give you the code as it was at that point in time. If you want to re-apply any changes from other branches that you had previously pulled, Git will think this has already been applied and do nothing. Similarly, if you pull this branch into another that already has those changes, they will not be deleted just because the branch you're pulling from no longer has them. You need to revert.

Merges also cause a problem for revert. This is because when rolling back each commit, Git does not know by default which branch to continue rolling back in the event of it encountering a merge. you have to run something like git revert -m 1 YOUR_HASH..HEAD to revert. The -m 1 tells Git which branch to follow.

Saturday, September 14, 2019

How much data is enough?


The glib answer is you can never have enough. "Big data is often discussed along with machine learning... Some problems require big data, all the data you have. For example, simple statistical machine translation" (MachineLearningMastery).

But there is a rough guide:

Rule of 10


"The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model." (FastML)

"The amount of training data you need for a well performing model is 10x the number of parameters in the model." (Dr Malay Haldar)

We're all Bayesians Now

Should we be satisfied with rules-of-thumb? Is there a more mathematical basis for our confidence?

In the Bayesian world, everything(ish) is a distribution. If we had a suspicion that a coin was biased because we'd seen 1 head and 4 tails, we could model this with a beta distributionbeta(2,5). This would form our prior. (We calculate the hyperparameters of the beta distribution by adding 1 to our observations).

Why the beta distribution? “The Beta distribution is best for representing a probabilistic distribution of probabilities- the case where we don't know what a probability is in advance, but we have some reasonable guesses.” [StackOverflow]

If we continued to experimented 100 times and found the same coin yielded 63 heads and 37 tails, we could model this with a beta binomial(63, 37). Our posterior would now be proportional to beta(65, 42).
This is because of the idea of "conjugate priors to describe a prior distribution that works well with a distribution that depends on the parameter. Technically, this means that the posterior distribution has the same form as the prior distribution. Using this term, the beta distribution is the conjugate prior of the binomial, because the posterior distribution over the parameter is also a beta distribution. When you have a conjugate prior distribution, the integration in Bayes’ rule has an easy solu- tion. This is why conjugate distributions are used frequently in Bayesian statistics. But when using probabilistic programming, you’re not limited to conjugate distributions." [PPP, Pfeffer].

Wait! A beta what?

A beta binomial. This is a binomial distribution that you would have been taught in high school:

nCk pn(1-p)k

multiplied by the probability described by a beta distribution's probability density function:

pα-1(1-p)β-1 / B(α, β)

Where B is the beta function - not to be confused with our beta distribution that uses it.

The beta function is:

beta(α, β) = Γ(α)Γ(β) / Γ(α + β)

where simply

Γ(n)=(n - 1)!

Anyway, we take the products of these two distributions. But we must normalize it so we divide it by itself integrated over p between [0,1] as seen here. This last bit is a bit hard but you can look the integral up in a book and then, cancelling out, see the whole equation become another beta distribution with the original hyperparameters plus the observations.

Calculation

So, out confidence becomes more concentrated. Where exactly can be calculated easily as the mode of a beta distribution has a simple formula (that's one reason why it was chosen for this example) and comes to 0.6095238095238096 (using 64-bit floats). This is not always the case, so we could use a tool like PyMC to calculate it:

import pymc3 as pm

basic_model = pm.Model()

with basic_model:
    alpha = pm.Beta('mybeta', 65, 42)

map_estimate = pm.find_MAP(model=basic_model)

print(map_estimate['mybeta']) 

Which produces 0.60952377 - a value not far from the analytically derived answer.

Interpretation

This example is taken from Pfeffer's Practical Probabilistic Programming. In it, "you’re trying to predict the toss of the 101st coin, based on the outcome of the first 100 tosses. You have three variables:

- the Bias of the coin
- the NumberOfHeads in the first 100 tosses
- the outcome of Toss 101.

"Bias is generated first, and then all of the coin tosses depend on Bias. If Bias is known, the coin tosses are independent. Adding independence information can significantly simplify the formula. Instead of including all previous variables on the right-hand side of the “|” for a given variable, you need to include only the variables it directly depends on in the dependency model

"Remember, when I say that Bias is generated first, I’m describing the generative process, not that Bias is known first. This is yet another example indicating that the order in which variables are generated isn’t necessarily the order of inference. In our example, Bias is generated first, but inference goes from NumberOfHeads to Bias... The direction of the arrows in the network isn’t necessarily the direction in which you expect to do inference."

Anyway, there are 3 ways to calculate the probability of the 101st coin toss, each with pros and cons.

Maximum a Posteriori

"The first step is to compute a posterior distribution over the Bias by using the approach of the previous section. In the next step, you compute the value of the Bias that’s the peak of beta(65, 42).

"In addition to being able to counter overfitting, the MAP method is also relatively efficient, because it doesn’t require integrating over all parameter values. But it does require specifying a prior, which can be difficult."

There's a simple equation for this that yields 0.6095 because we chose convenient equations to model our problem. In the real world, we'd probably need something like PyMC or Figaro.

Maximum Likelihood Estimation

MLE is a "commonly used special case of the MAP estimation process [where] you choose the parameter values that fit the data the best, without regard to any prior... The MLE method provides the best fit to the data, but is also liable to overfit the data."

It gives an answer of 0.63 exactly (that is the observed 63 heads out of 100 tosses).
The theory behind MLE goes like this (see All of Statistics, Wasserman, p122). The likelihood function is defined as:

n(θ) = ∏ni=1f(Xi; θ) 

There are some important things to note:
  1. It is a function of the parameter θ, not the data.
  2. "In general, it is not true that n(θ) integrates to 1 (with respect to θ)".
  3. The maximum of the log of n(θ) is at exactly the same place as the maximum of n(θ) itself and this makes the maths much easier.
Now, take say a Bernoulli distribution where:

f(Xi; p) = px(1-p)1-x for x = {0, 1}

Plug this into the function for n(θ) above, noting that a product of exponentials is an exponential of a sum; differentiate wrt. θ; set to zero; and solve for p, You'll find that the MLE is ΣiXi/n  (ie, the mean)

Full Bayesian method

"Instead of estimating a single value of the Bias, it uses the full posterior distribution over Bias." This requires integrating over the formula and normalizing. This can be difficult in the real world but happens to be easy for our beta binomial which gives an answer of 0.6075.

The full Bayesian "can be superior to the other approaches, because it uses the full distribution. In particular, when the mode of the distribution isn’t representative of the full distribution, the other approaches can be misleading." In our example with the beta function, this is not a problem. But the real world might be very different.