Thursday, July 12, 2018

Are you better?

There is a certain amount of randomness in neural nets. Run it twice and you're unlikely to get the same answer. So, when making changes, how do we know things have improved?

I ran exactly the same Spark neural net code on slightly different data. The data represented the same use case but in one run I used L1 normalization and in the other L2. For each normalization technique, I ran the neural net 9 times. The accuracy looked like this:
 Accuracy of a neural net. Red is when using data with L2 normalization, blue with L1.
At first blush, it looks like L1 is better but it's not certain. After all some of the high values for L2 accuracy are higher than the low values for L1. And since each normalization only has 9 data points each, could this apparent difference just be due to luck?

 Accuracy (%) Std. Dev. (%) L1 94.9 0.36 L2 94.3 0.42

The Classical Approach

The standard deviation in a sample can be used to estimate the standard deviation of the whole population. The standard deviation of the mean is σ/√N (aka the standard error) where σ is the (true) standard deviation of the population and N is the number of measurements per trial [Boaz p772].

To demonstrate, let's take 5 observations 1 million times. We'll compare the calculated value with the results of a simulation in Python:

N = 5
Nsamp = 10 ** 6
sigma_x = 2

x = np.random.normal(0, sigma_x, size=(Nsamp, N))

mu_samp = x.mean(1)  # 10 ** 6 values that are the mean of each row
sig_samp = sigma_x * N ** -0.5

print("{0:.3f} should equal {1:.3f}".format(np.std(mu_samp, ddof=1), sig_samp))

This outputs:

0.894 should equal 0.894

(Code stolen from here).

Using this relationship between the standard deviation of the samples and the true mean, we can then conduct a t-test.

"Using a t-test allows you to test
• whether the mean of a sample differs significantly from an expected value, or
• whether the means of two groups differ significantly from an expected value, or
• whether the means of two groups differ significantly from each other.
"[William 'Student'] Gosset's key observation was the dependence on sample size for determining the probability that the mean of the population lies within a given distance of the mean of the sample, if a normal distribution is assumed... Gosset noted that when samples are collected from a normal distribution, and if the number of samples is small, and these are used to estimate the variance, then the distribution (for the variable x):

t = x̄ - μ / (σ/√N)

is both flatter, and has more observations appearing in the tails, than a normal distribution, when the samples sizes are less than 30... This distribution is known as the t distribution and approximates a normal distribution if n (and by implication [degrees of freedom]) are large (greater than 30 in practical terms)." [1]

All things being equal, "most of the time, you’d expect to get t-values close to 0. That makes sense, right? Because if you randomly select representative samples from a population, the mean of most of those random samples from the population should be close to the overall population mean, making their differences (and thus the calculated t-values) close to 0... The t-value measures the size of the difference relative to the variation in your sample data... T is simply the calculated difference represented in units of standard error" [from a blog here].

Calculating the t-value in Python is easy:

t_value, p_value) = stats.ttest_ind(data.l1, data.l2, equal_var=False)
print("t = {0:.3f}, p = {1:.3f}".format(t_value, p_value))

t = 2.755, p = 0.014

The values for p (probability of getting a given t-value) and the t-value itself are related. Given a distribution that represents the PDF for a value of t, then p is the area under that curve that represents the probability of having that t value or less.

"Confidence limits are expressed in terms of a confidence coefficient. Although the choice of confidence coefficient is somewhat arbitrary, in practice 90%, 95%, and 99% intervals are often used, with 95% being the most commonly used." [NIST]

Given our data, it is unlikely they had been drawn from the same universe of data because p=0.014. This result is in our 95% (if not 99%) confidence interval.

[Aside: "As a technical note, a 95% confidence interval does not mean that there is a 95% probability that the interval contains the true mean. The interval computed from a given sample either contains the true mean or it does not. Instead, the level of confidence is associated with the method of calculating the interval. The confidence coefficient is simply the proportion of samples of a given size that may be expected to contain the true mean. That is, for a 95% confidence interval, if many samples are collected and the confidence interval computed, in the long run about 95 % of these intervals would contain the true mean." [NIST]]

The Bayesian Approach

"If you have taken a statistics course, you have probably been taught this technique (although not necessarily learned this technique... you may have felt uncomfortable with the derivation" [2].

So, let's now refer to Bayesian Methods for Hackers (at github). I assumed that, given the data, the true standard deviation uniformly distributed between 0% and 2%, and that the means were uniformly anywhere between 90% and 99%. Finally, I assumed the values for the accuracy are distributed by a Gaussian.

These assumptions may or may not be correct but they're not outrageous and PyMC should find the optimum answer anyway.

The probable difference between the mean accuracys look like this (code available on my own GitHub repository):
 Histogram of a the difference in accuracy between L1 and L2 normalization using the same ANN.
Which shows that, given what we know, there is most likely an improvement in using L1 over L2 of about 0.5%, a quantity the classical approach did not tell us. There is a possibility that despite the data, L2 is actually better than L1 but as we can see from the graph this is very unlikely.

[1] Statistics in a Nutshell, O'Reilly
[2] Bayesian Methods for Hackers, Davidson-Pilon