Dr Bristol has tea with statistician Sir Ronald Fisher. She claims she can tell the difference between when the tea is poured first then the milk or milk first then the tea. Sir Ronald doesn't quite believe this and tests her. Sure enough, she gets it right five out of six times. Is this statistically significant?
Sir Ronald "arrived at a method that consists of calculating the total probability of the observed result plus the probability of any more extreme results possible under the null hypothesis (i.e., the probability that she would correctly identify 5 or 6 cups by sheer guessing). This probability is the p-value. If it is less than .05, then Fisher would declare the result significant and reject the null hypothesis of guessing." (from here).
The p-value debate has been rumbling for a while ("for example, a 0.05 p-value does not imply the false positive rate is 5.0%, it can be much higher."). Briefly, if there are 100 000 hypothesis and only 10% are true, of the 90 000 that are not true, 4500 will randomly be within the 5% (p-value = 0.05) range. So, 4500 of the 14 500 hypothesis that we think are true are false - and this is if we correctly identify all of the 10%! That means 31% of our knowledge is rubbish and that's the best case scenario!
Devout Bayesian, Prof Dennis Lindley, in this article discusses why they are inadequate and proposes a Bayesian solution.
Lindley notes that when using p-values, the probabilities change depending on your experiment and not the results. To illustrate:
"If the experiment consisted of 6 pairs of cups being tested and the result was RRRRRW, the relevant probability is .109", that is 6C5 ½6 + 6C6 ½6 and here, R means 'right' and W means 'wrong'.
"If the experiment consisted of pairs being tested until the first error, with the same result, the relevant probability is .031", that is ½5.
This "leads to absurdities because it hinges upon the nonexistent ability to define what other unobserved results would count as “more extreme” than the actual observations. That is, if Fisher had set out to serve Dr. Bristol 6 cups (and only 6 cups) and she is correct 5 times, then we get a p-value of .1, which is not statistically significant. According to Fisher, in this case we should not reject the null hypothesis that Dr. Bristol is guessing. But had he set out to keep giving her additional cups until she was correct 5 times, which incidentally required 6 cups, we get a p-value of .03, which is statistically significant. According to Fisher, we should now reject the null hypothesis. Even though the data observed in both cases are exactly the same, we reach different conclusions because our definition of “more extreme” results (that did not occur) changes depending on which sampling plan we use." (Etz et al)
Bayes defined the posterior probability as:
posterior = K x prior x likelihood
"where K is a number chosen to make the integral of the right-hand side 1." (Lindley)
"Likelihood is a funny concept. It’s not a probability, but it is proportional to a probability ... Since a likelihood isn’t actually a probability it doesn’t obey various rules of probability. For example, likelihood need not sum to 1.
"A critical difference between probability and likelihood is in the interpretation of what is fixed and what can vary. In the case of a conditional probability, P(D|H), the hypothesis is fixed and the data are free to vary. Likelihood, however, is the opposite. The likelihood of a hypothesis, L(H|D), conditions on the data as if they are fixed while allowing the hypotheses to vary.
"Bayes factors are simple extensions of likelihood ratios. A Bayes factor is a weighted average likelihood ratio based on the prior distribution specified for the hypotheses." [from Etz's blog]
Lindley considers the case of wine and tea tasting and people who claim they can tell you about the process with which they were made. The author believes that wine experts really know what they're tasting while tea experts do not.
Lindley encodes these beliefs in the following equations:
Belief that the wine taster can correctly identify wines with probability P:
48(1-P)(P-½) for ½ < P < 1
"This expresses the fact that I think that she can discriminate but can make mistakes. The value 48 makes the total probability 1," he says.
Belief that the tea taster can correctly identify the preparation process with probability P:
l.6(l-P) for P ≥ ½
Note the l.6 is derived from integrating this function between 1 ≥ P ≥ ½ and equating it to 1 (ie, the area under the curve cannot exceed 1.0 as that's the highest a probability can go).
Given 6 trials, the probability that the taster correctly identifies all 6 is P6. The posterior in this case for tea is:
normalizer . prior . likelihood = K (1 - P) P6 for ½ < P < 1
The probability that the taster identifies the first 5 correctly but the last incorrectly is P5(1-P).
"The prior value of this probability was .8, which drops to .59 when 1 error is made in 6 pairs", that is, P at the maximum posterior at RRRRRW is about 0.63.
"...and to .23 with no errors", that is, P at the maximum posterior for RRRRRR is about 0.86. So, the prior for this is 1.6 x (1 - 0.86) = 0.23
[Incidentally, we know this because the posterior ~ P6(1-P). So, if we differentiate to find the maximum:
d (P6(1-P)) = 0
which means P=6/7. This is the probability of the taster having special powers. Plug this in to our equation representing out belief and this becomes l.6(l-6/7) which is the 0.23 Professor Lindley is talking about.]
Naturally, the graph for wine tasting is different as our priors are a quadratic, not a linear equation in P. Their peaks are somewhat closer to 1.0 than our tea tasters (as we'd expect given the prior expresses greater confidence). But it's also worth noting the shape of the graphs. There is greater certainty when we see 6 Rs than with RRRRRW as the curve is sharper.
"It is typically true that the posterior probability of the null hypothesis exceeds the significance level, though there is no logical connection between the two values." (Lindley)
"Lindley’s Bayesian approach depends only on the observed data, so the results are interpretable regardless of whether the sampling plan was rigid or flexible or even known at all. Another key point is that the Bayesian approach is inherently comparative: Hypotheses are tested against one another and never in isolation" (Etz et al).