Recipe
A hypothesis test is only a partial answer to the question "am I being fooled by randomness?". It does not address sampling bias nor measurement errors both of which might be responsible for the apparent effect. [1]
The way to put a hypothesis test together is:
1. Define a test statistic.
2. Define a null hypothesis.
3. Ensure the model can generate synthetic data.
4. Calculate the p-value by generating data (step 3) that fulfills the test statistic.
"All hypothesis tests fit into this framework." [2]
p-value
"Ideally we should compute the probability of seeing the effect (E) under both hypotheses; that is P(E | H0) and P(E | HA). But formulating HA is not always easy, so in conventional hypothesis testing, we just compute P(E | H0), which is the p-value" [3]
Given our recipe above, this is how we'd calculate the p-value of a (potentially) crooked dice.
1. Run 1000 simulations
2. Find the chi-squared value for each
3. Count the number above a threshold defined by the chi-squared value for the (expected, actual) tuple.
4. Divide by the 1000 trials and you have the p-value
Note that "the p-value depends on the choice of test statistic and the model of the null hypothesis, and sometimes these choices determine whether an effect is statistically significant or not." [4] "If different models yield very different results, that's a useful warning that the results are open to interpretation." [2]
Problem: In ThinkStats2, the assertion is made that first babies arrive earlier than subsequent babies. Given pregnancy duration for 4413 first babies and 4735 subsequent births, there is a 0.078 week different between the two means. Is this significant?
Methodology: we repeatedly sample pairs of births and measure their difference. After N samples, we calculate our mean difference.
With the null hypothesis, we assume there is no effect and sample from all data without distinguishing between the two labels. We do this for N iterations by shuffling the original data and assigning each datum to one of two groups. Then, we take the difference in the means of those two groups. After N samples, we calculate our mean difference.
1 Statistical Inference is only mostly wrong - Downey
2 There is still only one test - Downey
3 There is only one test! - Downey
4 Think Stats2 - Downey
No comments:
Post a Comment