Agile Java Man: Hypothesis Testing Notes

Wednesday, January 13, 2021

Hypothesis Testing Notes

Recipe

A hypothesis test is only a partial answer to the question "am I being fooled by randomness?". It does not address sampling bias nor measurement errors both of which might be responsible for the apparent effect. [1]

The way to put a hypothesis test together is:

1. Define a test statistic.

2. Define a null hypothesis.

3. Ensure the model can generate synthetic data.

4. Calculate the p-value by generating data (step 3) that fulfills the test statistic.

"All hypothesis tests fit into this framework." [2]

p-value

"Ideally we should compute the probability of seeing the effect (E) under both hypotheses; that is P(E | H0) and P(E | HA). But formulating HA is not always easy, so in conventional hypothesis testing, we just compute P(E | H0), which is the p-value" [3]

Given our recipe above, this is how we'd calculate the p-value of a (potentially) crooked dice.

1. Run 1000 simulations
2. Find the chi-squared value for each
3. Count the number above a threshold defined by the chi-squared value for the (expected, actual) tuple.
4. Divide by the 1000 trials and you have the p-value

Note that "the p-value depends on the choice of test statistic and the model of the null hypothesis, and sometimes these choices determine whether an effect is statistically significant or not." [4] "If different models yield very different results, that's a useful warning that the results are open to interpretation." [2]

Worked Example

Problem: In ThinkStats2, the assertion is made that first babies arrive earlier than subsequent babies. Given pregnancy duration for 4413 first babies and 4735 subsequent births, there is a 0.078 week different between the two means. Is this significant?

Methodology: we repeatedly sample pairs of births and measure their difference. After N samples, we calculate our mean difference.

With the null hypothesis, we assume there is no effect and sample from all data without distinguishing between the two labels. We do this for N iterations by shuffling the original data and assigning each datum to one of two groups. Then, we take the difference in the means of those two groups. After N samples, we calculate our mean difference.

1 Statistical Inference is only mostly wrong - Downey

2 There is still only one test - Downey

3 There is only one test! - Downey

4 Think Stats2 - Downey