Agile Java Man: How much data is enough?

The glib answer is you can never have enough. "Big data is often discussed along with machine learning... Some problems require big data, all the data you have. For example, simple statistical machine translation" (MachineLearningMastery).

But there is a rough guide:

Rule of 10

"The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model." (FastML)

"The amount of training data you need for a well performing model is 10x the number of parameters in the model." (Dr Malay Haldar)

We're all Bayesians Now

Should we be satisfied with rules-of-thumb? Is there a more mathematical basis for our confidence?

In the Bayesian world, everything(ish) is a distribution. If we had a suspicion that a coin was biased because we'd seen 1 head and 4 tails, we could model this with a beta distribution, beta(2,5). This would form our prior. (We calculate the hyperparameters of the beta distribution by adding 1 to our observations).

Why the beta distribution? “The Beta distribution is best for representing a probabilistic distribution of probabilities- the case where we don't know what a probability is in advance, but we have some reasonable guesses.” [StackOverflow]

If we continued to experimented 100 times and found the same coin yielded 63 heads and 37 tails, we could model this with a beta binomial(63, 37). Our posterior would now be proportional to beta(65, 42).

This is because of the idea of "conjugate priors to describe a prior distribution that works well with a distribution that depends on the parameter. Technically, this means that the posterior distribution has the same form as the prior distribution. Using this term, the beta distribution is the conjugate prior of the binomial, because the posterior distribution over the parameter is also a beta distribution. When you have a conjugate prior distribution, the integration in Bayes’ rule has an easy solu- tion. This is why conjugate distributions are used frequently in Bayesian statistics. But when using probabilistic programming, you’re not limited to conjugate distributions." [PPP, Pfeffer].

Wait! A beta what?

A beta binomial. This is a binomial distribution that you would have been taught in high school:

ⁿC_k pⁿ(1-p)^k

multiplied by the probability described by a beta distribution's probability density function:

p^α-1(1-p)^β-1 / B(α, β)

Where B is the beta function - not to be confused with our beta distribution that uses it.

The beta function is:

beta(α, β) = Γ(α)Γ(β) / Γ(α + β)

where simply

Γ(n)=(n - 1)!

Anyway, we take the products of these two distributions. But we must normalize it so we divide it by itself integrated over p between [0,1] as seen here. This last bit is a bit hard but you can look the integral up in a book and then, cancelling out, see the whole equation become another beta distribution with the original hyperparameters plus the observations.

Calculation

So, out confidence becomes more concentrated. Where exactly can be calculated easily as the mode of a beta distribution has a simple formula (that's one reason why it was chosen for this example) and comes to 0.6095238095238096 (using 64-bit floats). This is not always the case, so we could use a tool like PyMC to calculate it:

import pymc3 as pm

basic_model = pm.Model()

with basic_model:
alpha = pm.Beta('mybeta', 65, 42)

map_estimate = pm.find_MAP(model=basic_model)

print(map_estimate['mybeta'])

Which produces 0.60952377 - a value not far from the analytically derived answer.

Interpretation

This example is taken from Pfeffer's Practical Probabilistic Programming. In it, "you’re trying to predict the toss of the 101st coin, based on the outcome of the first 100 tosses. You have three variables:

- the Bias of the coin
- the NumberOfHeads in the first 100 tosses
- the outcome of Toss 101.

"Bias is generated first, and then all of the coin tosses depend on Bias. If Bias is known, the coin tosses are independent. Adding independence information can significantly simplify the formula. Instead of including all previous variables on the right-hand side of the “|” for a given variable, you need to include only the variables it directly depends on in the dependency model

"Remember, when I say that Bias is generated first, I’m describing the generative process, not that Bias is known first. This is yet another example indicating that the order in which variables are generated isn’t necessarily the order of inference. In our example, Bias is generated first, but inference goes from NumberOfHeads to Bias... The direction of the arrows in the network isn’t necessarily the direction in which you expect to do inference."

Anyway, there are 3 ways to calculate the probability of the 101st coin toss, each with pros and cons.

Maximum a Posteriori

"The first step is to compute a posterior distribution over the Bias by using the approach of the previous section. In the next step, you compute the value of the Bias that’s the peak of beta(65, 42).

"In addition to being able to counter overfitting, the MAP method is also relatively efficient, because it doesn’t require integrating over all parameter values. But it does require specifying a prior, which can be difficult."

There's a simple equation for this that yields 0.6095 because we chose convenient equations to model our problem. In the real world, we'd probably need something like PyMC or Figaro.

Maximum Likelihood Estimation

MLE is a "commonly used special case of the MAP estimation process [where] you choose the parameter values that fit the data the best, without regard to any prior... The MLE method provides the best fit to the data, but is also liable to overfit the data."

It gives an answer of 0.63 exactly (that is the observed 63 heads out of 100 tosses).
The theory behind MLE goes like this (see All of Statistics, Wasserman, p122). The likelihood function is defined as:

ℒ_n(θ) = ∏ⁿ_i=1f(X_i; θ)

There are some important things to note:

It is a function of the parameter θ, not the data.
"In general, it is not true that ℒ_n(θ) integrates to 1 (with respect to θ)".
The maximum of the log of ℒ_n(θ) is at exactly the same place as the maximum of ℒ_n(θ) itself and this makes the maths much easier.

Now, take say a Bernoulli distribution where:

f(X_i; p) = p^x(1-p)^1-x for x = {0, 1}

Plug this into the function for ℒ_n(θ) above, noting that a product of exponentials is an exponential of a sum; differentiate wrt. θ; set to zero; and solve for p, You'll find that the MLE is Σ_iX_i/n (ie, the mean)

Full Bayesian method

"Instead of estimating a single value of the Bias, it uses the full posterior distribution over Bias." This requires integrating over the formula and normalizing. This can be difficult in the real world but happens to be easy for our beta binomial which gives an answer of 0.6075.

The full Bayesian "can be superior to the other approaches, because it uses the full distribution. In particular, when the mode of the distribution isn’t representative of the full distribution, the other approaches can be misleading." In our example with the beta function, this is not a problem. But the real world might be very different.

Agile Java Man

Saturday, September 14, 2019

How much data is enough?

No comments:

Post a Comment

Blog Archive

About Me