The glib answer is you can never have enough. "Big data is often discussed along with machine learning... Some problems require big data, all the data you have. For example, simple statistical machine translation" (
MachineLearningMastery).
But there is a rough guide:
Rule of 10
"The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model." (
FastML)
"The amount of training data you need for a well performing model is 10x the number of parameters in the model." (Dr
Malay Haldar)
We're all Bayesians Now
Should we be satisfied with rules-of-thumb? Is there a more mathematical basis for our confidence?
In the Bayesian world, everything(ish) is a distribution. If we had a suspicion that a coin was biased because we'd seen 1 head and 4 tails, we could model this with a
beta distribution,
beta(2,5). This would form our prior. (We calculate the hyperparameters of the beta distribution by adding 1 to our observations).
Why the beta distribution? “The Beta distribution is best for representing a probabilistic distribution of probabilities- the case where we don't know what a probability is in advance, but we have some reasonable guesses.” [
StackOverflow]
If we continued to experimented 100 times and found the same coin yielded 63 heads and 37 tails, we could model this with a
beta binomial(63, 37). Our posterior would now be proportional to
beta(65, 42).
This is because of the idea of "
conjugate priors to describe a prior distribution that works well with a distribution that depends on the parameter.
Technically, this means that the posterior distribution has the same form as the prior distribution. Using this term, the beta distribution is the conjugate prior of the binomial, because the posterior distribution over the parameter is also a beta distribution. When you have a conjugate prior distribution, the integration in Bayes’ rule has an easy solu- tion. This is why conjugate distributions are used frequently in Bayesian statistics. But when using probabilistic programming, you’re not limited to conjugate distributions." [PPP, Pfeffer].
Wait! A beta what?
A
beta binomial. This is a binomial distribution that you would have been taught in high school:
nCk pn(1-p)k
multiplied by the probability described by a
beta distribution's probability density function:
pα-1(1-p)β-1 / B(α, β)
Where
B is the
beta function - not to be confused with our beta
distribution that uses it.
The beta function is:
beta(α, β) = Γ(α)Γ(β) / Γ(α + β)
where simply
Γ(n)=(n - 1)!
Anyway, we take the products of these two distributions. But we must normalize it so we divide it by itself integrated over
p between [0,1] as seen
here. This last bit is a bit hard but you can look the integral up in a book and then, cancelling out, see the whole equation become another beta distribution with the original hyperparameters plus the observations.
Calculation
So, out confidence becomes more concentrated. Where exactly can be calculated easily as the mode of a beta distribution has a simple formula (that's one reason why it was chosen for this example) and comes to 0.6095238095238096 (using 64-bit floats). This is not always the case, so we could use a tool like PyMC to calculate it:
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
alpha = pm.Beta('mybeta', 65, 42)
map_estimate = pm.find_MAP(model=basic_model)
print(map_estimate['mybeta'])
Which produces 0.60952377 - a value not far from the analytically derived answer.
Interpretation
This example is taken from Pfeffer's Practical Probabilistic Programming. In it, "you’re trying to predict the toss of the 101st coin, based on the outcome of the first 100 tosses. You have three variables:
- the Bias of the coin
- the NumberOfHeads in the first 100 tosses
- the outcome of Toss 101.
"Bias is generated first, and then all of the coin tosses depend on Bias. If Bias is known, the coin tosses are independent. Adding independence information can significantly simplify the formula. Instead of including all previous variables on the right-hand side of the “|” for a given variable, you need to include only the variables it directly depends on in the dependency model
"Remember, when I say that Bias is generated first, I’m describing the generative process, not that Bias is known first. This is yet another example indicating that the order in which variables are generated isn’t necessarily the order of inference. In our example, Bias is generated first, but inference goes from NumberOfHeads to Bias... The direction of the arrows in the network isn’t necessarily the direction in which you expect to do inference."
Anyway, there are 3 ways to calculate the probability of the 101st coin toss, each with pros and cons.
Maximum a Posteriori
"The first step is to compute a posterior distribution over the Bias by using the approach of the previous section. In the next step, you compute the value of the Bias that’s the peak of
beta(65, 42).
"In addition to being able to
counter overfitting, the MAP method is also relatively efficient, because it doesn’t require integrating over all parameter values. But it does require specifying a prior, which can be difficult."
There's a simple equation for this that yields 0.6095 because we chose convenient equations to model our problem. In the real world, we'd probably need something like PyMC or Figaro.
Maximum Likelihood Estimation
MLE is a "commonly used special case of the MAP estimation process [where] you choose the parameter values that
fit the data the best, without regard to any prior... The MLE method provides the
best fit to the data, but is also liable to overfit the data."
It gives an answer of 0.63 exactly (that is the observed 63 heads out of 100 tosses).
The theory behind MLE goes like this (see
All of Statistics, Wasserman, p122). The likelihood function is defined as:
- It is a function of the parameter θ, not the data.
- "In general, it is not true that ℒn(θ) integrates to 1 (with respect to θ)".
- The maximum of the log of ℒn(θ) is at exactly the same place as the maximum of ℒn(θ) itself and this makes the maths much easier.
Now, take say a Bernoulli distribution where:
f(Xi; p) = px(1-p)1-x for x = {0, 1}
Plug this into the function for ℒn(θ) above, noting that a product of exponentials is an exponential of a sum; differentiate wrt. θ; set to zero; and solve for p, You'll find that the MLE is ΣiXi/n (ie, the mean)
Full Bayesian method
"Instead of estimating a single value of the Bias, it uses the full posterior distribution over Bias." This requires integrating over the formula and normalizing. This can be difficult in the real world but happens to be easy for our beta binomial which gives an answer of 0.6075.
The full Bayesian "can be superior to the other approaches, because it uses the full distribution. In particular, when the mode of the distribution isn’t representative of the full distribution, the other approaches can be misleading." In our example with the beta function, this is not a problem. But the real world might be very different.