Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

4.4 Analysis of Variance (ANOVA)

Principles of Data Science4.4 Analysis of Variance (ANOVA)

Learning Outcomes

By the end of this section, you should be able to:

  • 4.4.1 Set up and apply one-way analysis of variance hypothesis test.
  • 4.4.2 Use Python to conduct one-way analysis of variance.

In Testing Claims Based on One Sample, we reviewed methods to conduct hypothesis testing for two means; however, there are scenarios where a researcher might want to compare three or more means to determine if any of the means are different from one another. For example, a medical researcher might want to compare three different pain relief medications to compare the average time to provide relief from a migraine headache. Analysis of variance (ANOVA) is a statistical method that allows a researcher to compare three or more means and determine if the means are all statistically the same or if at least one mean is different from the others. The one-way ANOVA focuses on one independent variable. For two independent variables, a method called “two-way ANOVA” is applicable, but this method is beyond the scope of this text. We explore the one-way ANOVA next.

One-Way ANOVA

To compare three or more means, the sample sizes for each group can be different sample sizes (samples sizes do not need to be identical). For one-way ANOVA hypothesis testing, we will follow the same general outline of steps for hypothesis testing that was discussed in Hypothesis Testing.

The first step will be to write the null and alternative hypotheses. For one-way ANOVA, the null and alternative hypotheses are always stated as the following:

H0H0: All population means are equal: µ1=µ2=µ3=µ1=µ2=µ3=

HaHa: At least one population mean is different from the others.

Here are the requirements to use this procedure:

  1. The samples are random and selected from approximately normal distributions.
  2. The samples are independent of one another.
  3. The population variances are approximately equal.

In this discussion, we assume that the population variances are not equal.

When these requirements are met, the F-distribution is used as the basis for conducting the hypothesis test.

The F-distribution is a skewed distribution (skewed to the right), and the shape of the distribution depends on two different degrees of freedom, referred to as degrees of freedom for the numerator and degrees of freedom for the denominator.

Figure 4.10 shows the typical shape of the F-distribution:

A graph showing a nonsymmetrical F distribution curve. The horizontal axis extends from 0 to 4.5 and the vertical axis ranges from 0 to 1. The curve is skewed to the right.
Figure 4.10 Shape of the F-Distribution

The test statistic for this hypothesis test will be the ratio of two variances, namely the ratio of the variance between samples to the ratio of the variances within the samples.

Test Statistic=Variance between samplesVariance within samplesTest Statistic=Variance between samplesVariance within samples

The numerator of the test statistic (variance between samples) is sometimes referred to as variation due to treatment or explained variation.

The denominator of the test statistic (variance within samples) is sometimes referred to as variation due to error, or unexplained variation.

The details for a manual calculation of the test statistic are provided next; however, it is very common to use software for ANOVA analysis.

For the purposes of this discussion, we will assume we are comparing three means, but keep in mind the ANOVA method can be applied to three or more means.

  1. Find the mean for each of the three samples. Label these means as x1,x2,x3x1,x2,x3. Find the variance for each of the three samples. Label these variances as s12,s22,s32s12,s22,s32.
  2. Find the grand mean (label this as x=x=). This is the sum of all the data values from the three samples, divided by the overall sample size.
  3. Calculate the quantity called “sum of squares between (SSB)” according to the following formula:
    SSB=n1(x1x=)2+n2(x2x=)2+n3(x3x=)2SSB=n1(x1x=)2+n2(x2x=)2+n3(x3x=)2
  4. Calculate the quantity called “sum of squares within (SSW)” according to the following formula:
    SSW=(n11)s12+(n21)s22+(n31)s32SSW=(n11)s12+(n21)s22+(n31)s32
  5. Calculate the degrees of freedom for the numerator (dfndfn) and degrees of freedom for the denominator (dfddfd).
    For degrees of freedom for the numerator, this is calculated as the number of groups minus 1. For example, if a medical researcher is comparing three different pain relief medications to compare the average time to provide relief from a migraine headache, there are three groups, and thus the degrees of freedom for the numerator would be 31=231=2.
    For degrees of freedom for the denominator, this is calculated as the total of the sample sizes minus the number of groups.
  6. Calculate the quantity called “mean square between (MSB)” according to the following formula:
    MSB=SSBdfnMSB=SSBdfn
  7. Calculate the quantity called “mean square within (MSW)” according to the following formula:
    MSW=SSWdfdMSW=SSWdfd
  8. Calculate the test statistic (FF), according to the following formula:
    F=MSBMSWF=MSBMSW

Once the test statistic is obtained, the p-value is calculated as the area to the right of the test statistic under the FF-distribution curve (note that the ANOVA hypothesis test is always considered a “right-tail” test).

Usually, the results of these computations are organized in an ANOVA summary table like Table 4.17.

Variation Sum of Squares Degrees of Freedom Mean Squares FF Test Statistic p-value
Between SSB dfndfn MSB MSBMSWMSBMSW Area to the right of the F test statistic
Within SSW dfddfd MSW N/A
Table 4.17 ANOVA Summary Table

Using Python for One-Way ANOVA

As mentioned earlier, due to the complexity of these calculations, technology is typically used to calculate the test statistic and p-value.

Using Python, the f_oneway() function is provided as part of the scipy library, and this function provides both the test statistic and p-value for a one-way ANOVA hypothesis test.

The syntax is to call the function with the data arrays as arguments, as in:

f_oneway(Array1, Array2, Array3, …)

The function returns both the F test statistic and the p-value for the one-way ANOVA hypothesis test.

The details of this Python function are shown in Example 4.25.

Here is a step-by-step example to illustrate this ANOVA hypothesis testing process.

Example 4.25

Problem

An airline manager claims that the average arrival delays for flights from Boston to Dallas are the same for three airlines: Airlines A, B, and C. The following arrival delay data (in minutes) is collected on samples of flights for the three airlines (Table 4.18):

Airline A Airline B Airline C
8 13 11
12 0 15
7 2 9
9 7 11
11 6 12
4 5 N/A
Table 4.18 Arrival Delay Data (in Minutes) for Three Airlines

Use this sample data to test the claim that the average arrival delays are the same for three airlines: Airlines A, B, and C. Use a level of significance of 0.05 for this analysis. Assume the samples are independent and taken from approximately normal distributions and each population has approximately the same variance.

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.