Hypothesis Testing Blog

Shirsh Verma
10 min readMar 23, 2021

Data must be interpreted in order to add meaning.

We can interpret data by assuming a specific structure our outcome and use statistical methods to confirm or reject the assumption. The assumption is called a hypothesis and the statistical tests used for this purpose are called statistical hypothesis tests.

Whenever we want to make claims about the distribution of data or whether one set of results are different from another set of results in applied machine learning, we must rely on statistical hypothesis tests.

The main purpose of statistics is to test a hypothesis. For example, we might run an experiment and find that a certain drug is effective at treating headaches. But if we can’t repeat that experiment, no one will take wer results seriously. A good example of this was the cold fusion discovery, which petered into obscurity because no one was able to duplicate the results.

What is a Hypothesis?

A hypothesis is an educated guess about something in the world around us. It should be testable, either by experiment or observation. For example:

  1. A new medicine we think might work.
  2. A way of teaching we think might be better.
  3. A possible location of new species.
  4. A fairer way to administer standardized tests.

In statistics, a hypothesis test calculates some quantity under a given assumption. The result of the test allows us to interpret whether the assumption holds or whether the assumption has been violated.

Two concrete examples that we will use a lot in machine learning are:

  • A test that assumes that data has a normal distribution.
  • A test that assumes that two samples were drawn from the same underlying population distribution.

The assumption of a statistical test is called the null hypothesis, or Null Hypothesis (H0 for short). It is often called the default assumption, or the assumption that nothing has changed.

A violation of the test’s assumption is often called the first hypothesis, Alternate hypothesis or Ha for short. Ha is really a short hand for “some other hypothesis,” as all we know is that the evidence suggests that the H0 can be rejected.

  • Null (H0): Assumption of the test holds and is failed to be rejected at some level of significance.
  • Alternate (Ha): Assumption of the test does not hold and is rejected at some level of significance.

Before we can reject or fail to reject the null hypothesis, we must interpret the result of the test.

Statistical Test Interpretation

The results of a statistical hypothesis test must be interpreted for us to start making claims.

This is a point that may cause a lot of confusion for beginners and experienced practitioners alike.

There are two common forms that a result from a statistical hypothesis test may take, and they must be interpreted in different ways. They are the p-value and critical values. Interpret the p-value

We describe a finding as statistically significant by interpreting the p-value.

For example, we may perform a normality test on a data sample and find that it is unlikely that sample of data deviates from a Gaussian distribution, failing to reject the null hypothesis.

A statistical hypothesis test may return a value called p or the p-value. This is a quantity that we can use to interpret or quantify the result of the test and either reject or fail to reject the null hypothesis. This is done by comparing the p-value to a threshold value chosen beforehand called the significance level.

The significance level is often referred to by the Greek lower case letter alpha.

A common value used for alpha is 5% or 0.05. A smaller alpha value suggests a more robust interpretation of the null hypothesis, such as 1% or 0.1%.

The p-value is compared to the pre-chosen alpha value. A result is statistically significant when the p-value is less than alpha. This signifies a change was detected: that the default hypothesis can be rejected.

  • If p-value > alpha: Fail to reject the null hypothesis (i.e. not significant result).
  • If p-value <= alpha: Reject the null hypothesis (i.e. significant result).

For example, if we were performing a test of whether a data sample was normal and we calculated a p-value of .07, we could state something like:

The test found that the data sample was normal, failing to reject the null hypothesis at a 5% significance level.

“Reject” vs “Failure to Reject”

The p-value is probabilistic.

This means that when we interpret the result of a statistical test, we do not know what is true or false, only what is likely.

Rejecting the null hypothesis means that there is sufficient statistical evidence that the null hypothesis does not look likely. Otherwise, it means that there is not sufficient statistical evidence to reject the null hypothesis.

We may think about the statistical test in terms of the dichotomy of rejecting and accepting the null hypothesis. The danger is that if we say that we “accept” the null hypothesis, the language suggests that the null hypothesis is true. Instead, it is safer to say that we “fail to reject” the null hypothesis, as in, there is insufficient statistical evidence to reject it.

When reading “reject” vs “fail to reject” for the first time, it is confusing to beginners. We can think of it as “reject” vs “accept” in were mind, as long as we remind herself that the result is probabilistic and that even an “accepted” null hypothesis still has a small probability of being wrong. After we have determined which hypothesis the sample supports, we make decision. There are two options for a decision. They are “reject H0” if the sample information favors the alternative hypothesis or “do not reject H0” or “decline to reject H0” if the sample information is insufficient to reject the null hypothesis. Mathematical Symbols Used in H0 and Ha: After we have determined which hypothesis the sample supports, we make a decision. There are two options for a decision. They are “reject H0” if the sample information favors the alternative hypothesis or “do not reject H0” or “decline to reject H0” if the sample information is insufficient to reject the null hypothesis.

Mathematical Symbols Used in H0 and Ha:

Note

H0 always has a symbol with an equal in it. Ha never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers (including one of the co-authors in research work) use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

Hypothesis Testing General Framework

Interpret Critical Values

Some tests do not return a p-value.

Instead, they might return a list of critical values and their associated significance levels, as well as a test statistic.

These are usually nonparametric or distribution-free statistical hypothesis tests.

The choice of returning a p-value or a list of critical values is really an implementation choice.

The results are interpreted in a similar way. Instead of comparing a single p-value to a pre-specified significance level, the test statistic is compared to the critical value at a chosen significance level.

  • If test statistic < critical value: Fail to reject the null hypothesis.
  • If test statistic >= critical value: Reject the null hypothesis.

Again, the meaning of the result is similar in that the chosen significance level is a probabilistic decision on rejection or fail to reject the base assumption of the test given the data.

EXAMPLE

We want to test if college students take less than five years to graduate from college, on the average. The null and alternative hypotheses are:

H0: μ ≥ 5

Ha: μ < 5

EXAMPLE

In an issue of U.S. News and World Report, an article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third pass. The same article stated that 6.6% of U.S. students take advanced placement exams and 4.4% pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6%. State the null and alternative hypotheses.

H0: p ≤ 0.066

Ha: p > 0.066

Errors in Statistical Tests

The interpretation of a statistical hypothesis test is probabilistic.

That means that the evidence of the test may suggest an outcome and be mistaken.

For example, if alpha was 5%, it suggests that (at most) 1 time in 20 that the null hypothesis would be mistakenly rejected or failed to be rejected because of the statistical noise in the data sample.

Given a small p-value (reject the null hypothesis) either means that the null hypothesis false (we got it right) or it is true and some rare and unlikely event has been observed (we made a mistake). If this type of error is made, it is called a false positive. We falsely believe the rejection of the null hypothesis.

Alternately, given a large p-value (fail to reject the null hypothesis), it may mean that the null hypothesis is true (we got it right) or that the null hypothesis is false and some unlikely event occurred (we made a mistake). If this type of error is made, it is called a false negative. We falsely believe the null hypothesis or assumption of the statistical test.

Each of these two types of error has a specific name.

  • Type I Error: The incorrect rejection of a true null hypothesis or a false positive.
  • Type II Error: The incorrect failure of rejection of a false null hypothesis or a false negative.

All statistical hypothesis tests have a chance of making either of these types of errors. False findings or false discoveries are more than possible; they are probable.

Ideally, we want to choose a significance level that minimizes the likelihood of one of these errors. E.g. a very small significance level. Although significance levels such as 0.05 and 0.01 are common in many fields of science, harder sciences, such as physics, are more aggressive.

It is common to use a significance level of 3 * 10^-7 or 0.0000003, often referred to as 5-sigma. This means that the finding was due to chance with a probability of 1 in 3.5 million independent repeats of the experiments. To use a threshold like this may require a much large data sample.

Nevertheless, these types of errors are always present and must be kept in mind when presenting and interpreting the results of statistical tests. It is also a reason why it is important to have findings independently verified.

Connection between Type I error and significance level:

A significance level α corresponds to a certain value of the test statistic, say tα, represented by the orange line in the picture of a sampling distribution below (the picture illustrates a hypothesis test with alternate hypothesis “µ > 0”)

Since the shaded area indicated by the arrow is the p-value corresponding to tα, that p-value (shaded area) is α. To have p-value less than α , a t-value for this test must be to the right of tα. So the probability of rejecting the null hypothesis when it is true is the probability that t > tα, which we saw above is α. In other words, the probability of Type I error is α.1

Pros and Cons of Setting a Significance Level:

  • Setting a significance level (before doing inference) has the advantage that the analyst is not tempted to choose a cut-off on the basis of what he or she hopes is true.
  • It has the disadvantage that it neglects that some p-values might best be considered borderline. This is one reason2 why it is important to report p-values when reporting results of hypothesis tests. It is also good practice to include confidence intervals corresponding to the hypothesis test. (For example, if a hypothesis test for the difference of two means is performed, also give a confidence interval for the difference of those means. If the significance level for the hypothesis test is .05, then use confidence level 95% for the confidence interval.)

The following table summarizes Type I and Type II errors:

The following diagram illustrates the Type I error and the Type II error against the specific alternate hypothesis “µ =1” in a hypothesis test for a population mean µ, with null hypothesis “”µ = 0,” alternate hypothesis “µ > 0”, and significance level α= 0.05.

  • The blue (leftmost) curve is the sampling distribution assuming the null hypothesis “”µ = 0.”
  • The green (rightmost) curve is the sampling distribution assuming the specific alternate hypothesis “µ =1”.
  • The vertical red line shows the cut-off for rejection of the null hypothesis: the null hypothesis is rejected for values of the test statistic to the right of the red line (and not rejected for values to the left of the red line)>
  • The area of the diagonally hatched region to the right of the red line and under the blue curve is the probability of type I error (α)
  • The area of the horizontally hatched region to the left of the red line and under the green curve is the probability of Type II error (β)

--

--