7. Testing the Difference Between Variances

Overview

In addition to testing the difference between two means, we may want to test the difference between two variances. This lesson demonstrates a hypothesis test that can be used for this purpose.

To explore more Kubicle data literacy subjects, please refer to our full library.

Summary

  1. Lesson Goal (00:13)

    The goal of this lesson is to conduct a hypothesis test for the difference between two variances.

  2. Overview of the Problem (00:23)

    Testing the difference between two variances is similar in principle to testing the difference between two means. In this lesson, we consider the same problem as the previous lesson, where we have sample data on commuting times in two cities. Our aim is to test the hypothesis that the population variance in commuting times is the same for both cities.

  3. Assumptions of the Test (01:04)

    When testing the difference between two variances, we make two assumptions, which are the same assumptions we would make when testing the difference between two means. First, we assume that the two samples are independent simple random samples. Second, we assume the data for both populations follows a normal distribution.

     

    We can check the normality assumption in several ways, such as a Q-Q plot. In our data, we find that the data for one city has a few outliers, but as the dataset is small we suspect that this is not likely to signal that the normality assumption is violated.

  4. Null and Alternative Hypotheses (02:16)

    When testing the difference between variances, the null hypothesis is usually that the two population variances are equal. We use variances in the hypothesis, but if the variances are equal then the standard deviations will also be equal. We can also write this by saying that one of the variances divided by the other is one. We do this instead of considering the difference between the variances because the ratio of two variances is usually a more intuitive quantity than the difference between two variances.

     

    We can select a one-sided or two-sided alternative hypothesis. In our case, we select a two-sided alternative, where the variances are not equal, or the ratio of the variances is not equal to one.

  5. Calculating the Test Statistic (03:11)

    The test statistic is an F-statistic, and is easy to calculate. We calculate it by finding the ratio of the two sample variances. Note that some people recommend that the larger variance should be associated with the first sample, and the smaller variance should be associated with the second sample. This makes it easier to calculate the p-value from a statistical table, however it makes no difference if you determine the p-value from software or an online calculator.

     

    The test statistic follows an F-distribution, which has different degrees of freedom for the numerator and the denominator. The numerator degrees of freedom are the number of observations from the first sample minus one. The denominator degrees of freedom are the number of observations from the second sample minus one.

  6. Calculating the P-Value (04:26)

    After calculating the test statistic, we can calculate the p-value. The method used for this varies based on the alternative hypothesis. For a one-sided hypothesis featuring a less than condition, we want to find the area to the left of the test statistic. For a one-sided hypothesis featuring a greater than condition, we want to find the area to the right of the test statistic. For a two-sided hypothesis, we need to find both areas, then double the smaller area to get the p-value.

     

    As with a t-distribution, the best way to find areas from the F-distribution is using software or an online calculator. The calculator we use in this lesson can be found here.

  7. Formulating the Test Conclusions (06:15)

    Once we generate the p-value, we can formulate the conclusions in the usual way. If we reject the null hypothesis, we conclude that there is some difference between the variances for the two populations. If we do not reject the null hypothesis, then we conclude that there is not enough evidence of any difference in variances between the two populations.

Transcript

In the previous lesson, we learned how to conduct a hypothesis test for the difference between two means. We'll look at a similar concept in this lesson.

Our goal is to conduct a hypothesis test for the difference between two variances.

We'll consider the same problem we looked at in the last lesson.

We looked at commuting times for two samples of people from different cities. In Royal woods, the sample size was 13 people and the sample standard deviation of commute times was 3.4 minutes.

In Great Lake City, the sample size was 18 and the sample standard deviation was 8.2 minutes.

Note that we don't need to know the main commute time for either city in order to test the variances, so we've left that information out. Based on our sample data, it looks like the population standard deviations, and therefore the variances are likely to be different between the two cities, but we'd like to use a hypothesis test to confirm this.

In order to test the differences between the variances, we need to make two assumptions. First, we assume the two samples are independent, simple, random samples. In our sample, this is true as the same people are not included in both samples.

Second, we assume the commute times for the two populations follow a normal distribution. You may notice that these are the same assumptions we made when testing the difference between the means in the previous lesson.

Let's quickly check the normality assumption. Here, we can see QQ plots for both cities. Remember, we want the points to follow a straight line in order to show normality. In Royal Woods, the data appears to follow a straight line quite closely while in Great Lake City, the picture is less straightforward. Specifically, there appear to be two or possibly three low outliers at the bottom of the chart.

We have a fairly small sample size, so these could just be outliers or they could indicate the data is not actually normal. We'll assume these are just outliers, but we may need to be more cautious about our conclusions when we complete the test.

Let's now create the null and alternative hypothesis.

The null hypothesis is that the population variances are equal. We use variances in the hypothesis, but if the variances are equal, then the standard deviations will also be equal.

If the variances are equal, the ratio of variances equal one, we write the formula this way because the value of a variance is not always an intuitive quantity, but the ratio of two variances is somewhat more intuitive. For the alternative hypothesis, we can choose a one-sided or two-sided hypothesis as usual. In our case, we're interested in any difference between the variances.

Therefore we'll select a two-sided alternative where the two variances are not equal or the ratio of the two is not one.

Finally, we'll pick a significance level, which will be 0.05.

Next, we'll calculate the test statistic.

This statistic will be an F statistic, and as we can see it as a simple formula, we simply divide the sample variances by each other.

In our case, this produces a value of 0.1719.

Note that in this instance, the variance of city two was larger than the variance of city one. When testing variances, some sources recommend that the sample with a larger variance should always be sample one as this can make it easier to calculate the P value if you use a statistical table. However, we'll be using an online calculator to determine our P value, and in this case, it makes no difference which sample is sample one, and which a sample two.

The test statistic follows an F distribution. You may remember that the F distribution has degrees of freedom for the numerator and denominator. The degrees of freedom for the numerator are N one minus one or 12, and the degrees of freedom for the denominator are in two minus one or 17.

Here we see an F distribution with 12 and 17 degrees of freedom.

Our test statistic has a value of 0.1719. Let's discuss how we calculate the P value from this distribution, as it varies based on the alternative hypothesis.

If the alternative hypothesis is that variance one is less than variance two, we want to find the area to the left of 0.1719. If the alternative hypothesis is that variance one is greater than variance two, we want to find the area to the right of 0.1719.

If our alternative hypothesis is two-sided, we want to find both areas and the P value will be two times the smaller area.

Our alternative hypothesis is two-sided, so we want to find both areas. As with the T distribution, we'll use an online calculator to do this.

F tables do exist, but like with a T distribution, they only provide a range of possible P values, not a precise value.

Here we see an online calculator for the F distribution. V one is the numerator degrees of freedom, which is 12.

V2 is the denominator degrees of freedom, which is 17.

Our X value is 0.1719 and we'll select the less than probability option.

This tells us that the area to the left of 0.1719 is 0.0018 rounded to four decimal places.

Returning to our distribution, we can say that the area to the left of 0.1719 is 0.0018.

This means that the area to the right is 0.9982.

For our two-sided test, the P value is two times a smaller area, which is two times 0.018 or 0.036.

Let's now formulate our conclusions.

The null hypothesis was that the variance in commuting times between our two cities was equal.

The alternative hypothesis was that the variances were not equal.

The P value is 0.0036, which is considerably lower than the significance level of 0.05. As a result, we can reject the null hypothesis and conclude that there is a difference between the variance of commute times between the two cities. In fact, the P value is so small, we can be confident in our conclusions, even allowing for the possible normality issue we saw in one of the data sets. Let's stop the lesson here.

In the final two lessons of this course, we'll learn about analysis of variance or ANOVA, which lets us compare the means of more than two populations. We'll start by learning how to calculate the sum of squares in the next lesson.