Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

4.2 Hypothesis Testing

Principles of Data Science4.2 Hypothesis Testing

Learning Outcomes

By the end of this section, you should be able to:

  • 4.2.1 Apply hypothesis testing methods to test statistical claims involving one sample.
  • 4.2.2 Use Python to assist with hypothesis testing calculations.
  • 4.2.3 Conduct hypothesis tests to compare two means, two proportions, and matched pairs data.

Constructing confidence intervals is one method used to estimate population parameters. Another important statistical method to test claims regarding population parameters is called hypothesis testing.

Hypothesis testing is an important tool for data scientists in that it is used to draw conclusions using sample data, and it is also used to quantify uncertainty associated with these conclusions.

As an example to illustrate the use of hypothesis testing in a data science applications, consider an online retailer seeking to increase revenue though a text messaging ad campaign. The retailer is interested in examining the hypothesis that a text messaging ad campaign leads to a corresponding increase in revenue. To test out the hypothesis, the retailer sends out targeted text messages to certain customers but does not send the text messages to another group of customers. The retailer then collects data for purchases made by customers who received the text message ads versus customers who did not receive the ads. Using hypothesis testing methods, the online retailer can come to a conclusion regarding the marketing campaign to transmit ads via text messaging. Hypothesis testing involves setting up a null hypothesis and alternative hypotheses based on a claim to be tested and then collecting sample data. A null hypothesis represents a statement of no effect or no change in the population; the null hypothesis provides a statement that the value of a population parameter is equal to a value. An alternative hypothesis is a complementary statement to the null hypothesis and makes a statement that the population parameter has a value that differs from the null hypothesis.

Hypothesis testing allows a data scientist to assess the likelihood of observing results under the assumption of the null hypothesis and make judgements about the strength of the evidence against the null hypothesis. It provides the foundation for various data science investigations and plays a role at different stages of the analyses such as data collection and validation, modeling related tasks, and determination of statistical significance.

For example, a local restaurant might claim that the average delivery time for food orders is 30 minutes or less. To test the claim, food orders can be placed and corresponding delivery times recorded. If the sample data appears to be inconsistent with the null hypothesis, then the decision is to reject the null hypothesis.

Testing Claims Based on One Sample

For our discussion, we will examine hypothesis testing methods for claims involving means or proportions. (Hypothesis testing can also be performed for standard deviation or variance, though that is beyond the scope of this text.)

The steps for hypothesis testing are as follows:

  1. Set up a null and alternative hypothesis based on the claim. Identify whether the null or the alternative hypothesis represents the claim.
  2. Collect relevant sample data to investigate the claim.
  3. Determine the correct distribution to be used to analyze the hypothesis test (e.g., normal distribution or t-distribution).
  4. Analyze the sample data to determine if the sample data is consistent with the null hypothesis. This will involve statistical calculations involving what is called a test statistic and a p-value (discussed in the next section).
  5. If the sample data is inconsistent with the null hypothesis, then the decision is to “reject the null hypothesis.” If the sample data is consistent with the null hypothesis, then the decision is to “fail to reject the null hypothesis.”

The first step in the hypothesis testing process is to write what is a called a “null” hypothesis and an “alternative” hypothesis. These hypotheses will be statements regarding an unknown population parameter, such as a population mean or a population proportion. These two hypotheses are based on the claim under study and will be written as complementary statements to each other. When one of the hypotheses is true, the other hypothesis must be false (since they are complements of one another). Note that either the null or alternative hypothesis can represent the claim.

The null hypothesis is labeled as H0H0 and is a statement of equality—for example, the null hypothesis might state that the mean time to complete an exam is 42 minutes. For this example, the null hypothesis would be written as:

H0:µ=42 minutesH0:µ=42 minutes

Where the notation µµ refers to the population mean time to complete an exam.
The null hypothesis will always contain one of the following three symbols: =,,or=,,or.

The alternative hypothesis is labeled as HaHa and will be the complement of the null hypothesis. For example, if the null hypothesis contains an equals symbol, then the alternative hypothesis will contain a not equals symbol.

If the null hypothesis contains a symbol, then the alternative hypothesis will contain a << symbol.
If the null hypothesis contains a symbol, then the alternative hypothesis will contain a >> symbol.

Note that the alternative hypothesis will always contain one of the following three symbols: ,<,or>,<,or>.

To write the null and alternative hypotheses, start off by translating the claim into a mathematical statement involving the unknown population parameter. Depending on the wording, the claim can be placed in either the null or the alternative hypothesis. Then write the complement of this statement as the other hypothesis (see Table 4.6).

When translating claims about a population mean, use the symbol µµ as part of the null and alternative hypotheses. When translating claims about a population proportion, use the symbol p as part of the null and alternative hypotheses.

Table 4.6 provides the three possible setups for the null and alternative hypotheses when conducting a one-sample hypothesis test for a population mean. Notice that for each setup, the null and alternative hypotheses are complements of one another. Note that the value of “kk” will be replaced by some numerical constant taken from the stated claim.

Setup A Setup B Setup C
Ho:µ=kHo:µ=k Ho:µkHo:µk Ho:µkHo:µk
Ha:μkHa:μk Ha:µ>kHa:µ>k Ha:µ<kHa:µ<k
Table 4.6 Possible Setups for the Null and Alternative Hypotheses for a Population Mean (One Sample)

Table 4.7 provides the three possible setups for the null and alternative hypotheses when conducting a one-sample hypothesis test for a population proportion. Note that the value of “kk” will be replaced by some numerical constant taken from the stated claim.

Setup D Setup E Setup F
Ho:p=kHo:p=k Ho:pkHo:pk Ho:pkHo:pk
Ha:pkHa:pk Ha:p>kHa:p>k Ha:p<kHa:p<k
Table 4.7 Possible Setups for the Null and Alternative Hypotheses for a Population Proportion (One Sample)

Follow these steps to determine which of these setups to use for a given hypothesis test:

  1. Determine if the hypothesis test involves a claim for a population mean or population proportion. If the hypothesis test involves a population mean, use either Setup A, B, or C.
  2. If the hypothesis test involves a population proportion, use either Setup D, E, or F.
  3. Translate a phrase from the claim into a mathematical symbol and then match this mathematical symbol with one of the setups in Table 4.6 and Table 4.7.

For example, if a claim for a hypothesis test for a mean references the phrase “at least,” note that the phrase “at least” corresponds to a greater than or equals symbol. A greater than or equals symbol appears in Setup C in Table 4.6, so that would be the correct setup to use.

If a claim for a hypothesis test for a proportion references the phrase “different than,” note that the phrase “different than” corresponds to a not equals symbol. A not equals symbol appears in Setup D in Table 4.6, so that would be the correct setup to use.

Example 4.11

Problem

Write the null and alternative hypotheses for the following claim:
An auto repair facility claims the average time for an oil change is less than 20 minutes.

Example 4.12

Problem

Write the null and alternative hypotheses for the following claim:

A medical researcher claims that the proportion of adults in the United States who are smokers is at most 25%.

The basis of the hypothesis test procedure is to assume that the null hypothesis is true to begin with and then examine the sample data to determine if the sample data is consistent with the null hypothesis. Based on this analysis, you will decide to either reject the null hypothesis or fail to reject the null hypothesis.

It is important to note that these are the only two possible decisions for any hypothesis test, namely:

  • Reject the null hypothesis, or
  • Fail to reject the null hypothesis.

Keep in mind that this decision will be based on statistical analysis of sample data, so there is a small chance of coming to a wrong decision. The only way to be absolutely certain of your decision is to test the entire population, which is typically not feasible. Since the hypothesis test decision is based on a sample, there are two possible errors that can be made by the researcher:

  • The researcher rejects the null hypothesis when in fact the null hypothesis is actually true, or a Type I error.
  • A researcher fails to reject the null hypothesis when the null hypothesis is actually false, or a Type II error.

In hypothesis testing, the maximum allowed probability of making a Type I error is called the level of significance, denoted by the Greek letter alpha, αα. From a practical standpoint, the level of significance is the probability value used to determine when the sample data indicates significant evidence against the null hypothesis. This level of significance is typically set to a small value, which indicates that the researcher wants the probability of rejecting a true null hypothesis to be small. Typical values of the level of significance used in hypothesis testing are as follows: α=0.01α=0.01, α=0.05α=0.05, or α=0.10α=0.10.

Once the null and alternative hypotheses are written and a level of significance is selected, the next step is for the researcher to collect relevant sample data from the population under study. It’s important that a representative random sample is selected, as discussed in Handling Large Datasets. After the sample data is collected, two calculations are performed: test statistic and p-value.

The test statistic is a numerical value used to assess the strength of evidence against a null hypothesis and is calculated from sample data that is used in hypothesis testing. As noted in Estimating Parameters with Confidence Intervals, a sample statistic is a numerical value calculated from a sample of observations drawn from a larger population such as a sample mean or sample proportion. The sample statistic is converted to a standardized test statistic such as a z-score or t-score based on the assumption that the null hypothesis is true.

Once the test statistic has been determined, a probability, or p-value, is created. A p-value is the probability of obtaining a sample statistic with a value as extreme as (or more extreme than) the value determined by the sample data under the assumption that the null hypothesis is true. To calculate a p-value, we will find the area under the relevant probability distribution, such as finding the area under the normal distribution curve or the area under the t-distribution curve. Since the p-value is a probability, any p-value must always be a numerical value between 0 and 1 inclusive.

When calculating a p-value as the area under the probability distribution curve, the corresponding area will be determined using the location under the curve, which favors the rejection of the null hypothesis as follows:

  1. If the alternative hypothesis contains a “less than” symbol, the hypothesis test is called a “left tailed” test and the p-value will be calculated as the area to the left of the test statistic.
  2. If the alternative hypothesis contains a “greater than” symbol, the hypothesis test is called a “right tailed” test and the p-value will be calculated as the area to the right of the test statistic.
  3. If the alternative hypothesis contained a “not equals” symbol, the hypothesis test is called a “two tailed” test and the p-value will be calculated as the sum of the area to the left of the negative test statistic and the area to the right of the positive test statistic.

The smaller the p-value, the more the sample data deviates from the null hypothesis, and this is more evidence to indicate that the null hypothesis should be rejected.

The final step in the hypothesis testing procedure is to come to a final decision regarding the null hypothesis. This is accomplished by comparing the p-value with the level of significance and applying the following rule:

  • If the p-value level of significance, then the decision is to “reject the null hypothesis.”
  • If the p-value >> level of significance, then the decision is to “fail to reject the null hypothesis.”

Recall that these are the only two possible decisions that a researcher can deduce when conducting a hypothesis test.

Often, we would like to translate these decisions into a conclusion that is easier to interpret for someone without a statistical background. For example, a decision to “fail to reject the null hypothesis” might be difficult to interpret for those not familiar with the terminology of hypothesis testing. These decisions can be translated into concluding statements such as those shown in Table 4.8.

Select the decision from the hypothesis test: Is the claim in the null or the alternative hypothesis? Then, use this concluding statement:
Row 1 Reject the null hypothesis Claim is the null hypothesis There is enough evidence to reject the claim.
Row 2 Reject the null hypothesis Claim is the alternative hypothesis There is enough evidence to support the claim.
Row 3 Fail to reject the null hypothesis Claim is the null hypothesis There is not enough evidence to reject the claim.
Row 4 Fail to reject the null hypothesis Claim is the alternative hypothesis There is not enough evidence to support the claim.
Table 4.8 How to Establish a Conclusion from a Hypothesis Test

Testing Claims for the Mean When the Population Standard Deviation Is Known

Recall that in the discussion for confidence intervals we examined the confidence interval for the population mean when the population standard deviation is known (normal distribution is used) or when the population standard deviation is unknown (t-distribution is used). We follow the same procedure when conducting hypothesis tests.

In this section, we discuss hypothesis testing for the mean when the population standard deviation is known. Although the population standard deviation is typically unknown, in some cases a reasonable estimate for the population standard deviation can be obtained from past studies or historical data.

Here are the requirements to use this procedure:

  1. A random sample is selected from the population.
  2. The sample size is at least 30, or the underlying population is known to follow a normal distribution.
  3. The population standard deviation (σσ) is known.

When these requirements are met, the normal distribution is used as the basis for conducting the hypothesis test.

For this procedure, the test statistic will be a z-score and the standardized test statistic is calculated as follows:

z=xµσ/nz=xµσ/n

Where:
xx is the sample mean.
µµ is the hypothesized value of the population mean, which comes from the null hypothesis.
σσ is the population standard deviation.
nn is the sample size.

Example 4.13 provides a step-by-step example to illustrate the hypothesis testing process.

Example 4.13

Problem

A random sample of 100 college students finds that students spend an average of 14.4 hours per week on social media. Use this sample data to test the claim that college students spend at least 15 hours per week on social media. Use a level of significance of 0.05 and assume the population standard deviation is 3.25 hours.

Testing Claims for the Mean When the Population Standard Deviation Is Unknown

In this section, we discuss one sample hypothesis testing for the mean when the population standard deviation is unknown. This is a more common application of hypothesis testing for the mean in that the population standard deviation is typically unknown; however, the sample standard deviation can be calculated based on the sample data.

Here are the requirements to use this procedure:

  1. A random sample is selected from the population.
  2. The sample size is at least 30, or the underlying population is known to follow a normal distribution.
  3. The population standard deviation (σσ) is unknown; the sample standard deviation (ss) is known.

When these requirements are met, the t-distribution is used as the basis for conducting the hypothesis test.

For this procedure, the test statistic will be a t-score and the standardized test statistic is calculated as follows:

t=xµs/nt=xµs/n

Where:
xx is the sample mean.
µµ is the hypothesized value of the population mean, which comes from the null hypothesis.
ss is the sample standard deviation.
nn is the sample size.

Example 4.14 provides a step-by-step example to illustrate this process.

Example 4.14

Problem

A smartphone manufacturer claims that the mean battery life of its latest smartphone model is 25 hours. A consumer group decides to test this hypothesis and collects a sample of 50 smartphones and determines that the mean battery life of the sample is 24.1 hours with a sample standard deviation of 4.1 hours. Use this sample data to test the claim made by the smartphone manufacturer. Use a level of significance of 0.10 for this analysis.

Testing Claims for the Proportion

In this section, we discuss one sample hypothesis testing for proportions. Recall that a sample proportion is measuring “how many out of the total.” For example, a medical researcher might be interested in testing a claim that the proportion of patients experiencing side effects from a new blood pressure–lowering medication is less than 4%.

Here are the requirements to use this procedure:

  1. A random sample is selected from the population.
  2. Verify that the normal approximation to the binomial distribution is appropriate by ensuring that both np^np^ and n(1p^)n(1p^) are both at least 5.

The sample proportion p^p^ is calculated as the number of successes divided by the sample size:

p^=xnp^=xn

When these requirements are met, the normal distribution is used as the basis for conducting the hypothesis test.

For this procedure, the test statistic will be a z-score and the standardized test statistic is calculated as follows:

z=p^pp(1p)/nz=p^pp(1p)/n

Where:
p^p^ represents the sample proportion.
pp is the hypothesized value of the population proportion, which is taken from the null hypothesis.
nn is the sample size.

Here is a step-by-step example to illustrate this hypothesis testing process.

Example 4.15

Problem

A college professor claims that the proportion of students using artificial intelligence (AI) tools as part of their coursework is less than 45%. To test the claim, the professor selects a sample of 200 students and surveys the students to determine if they use AI tools as part of their coursework. The results from the survey indicate that 74 out of 200 students use AI tools. Use this sample data to test the claim made by the college professor. Use a level of significance of 0.05 for this analysis.

Using Python to Conduct Hypothesis Tests

Python provides several functions to assist with hypothesis testing, as the ztest() can be used for hypothesis testing for the mean when population standard deviation is known and another Python function called ttest_1samp() can be used for hypothesis testing for the mean when population standard deviation is unknown. See the example:

Example 4.16

Problem

A medical researcher wants to test a claim that the average oxygen level for females is greater than 75 mm Hg. In order to test the claim, a sample of 15 female patients is selected and the following oxygen levels are recorded:

76, 75, 74, 81, 76, 77, 71, 74, 73, 79

Use the ttest_1samp() function in Python to conduct a hypothesis test to test the claim. Assume a level of significance of 0.05.

Hypothesis Testing for Two Samples

In the previous section, we examined methods to conduct hypothesis testing for one sample. Researchers and data scientists are also often interested in conducting hypothesis tests for two samples. A researcher might want to compare two means or two proportions to determine if two groups are significantly different from one another. For example, a medical researcher might be interested in comparing the mean blood pressure between two groups of individuals: one group that is taking a blood pressure–lowering medication and another group that is taking a placebo. A government researcher might be interested in comparing the proportion of smokers for male versus female adults.

The same general hypothesis testing procedure will be used for two samples, but the setup of the null and alternative hypotheses will be structured to reflect the comparison of two means or two proportions.

As a reminder, the general hypothesis testing method introduced in the previous section includes the following steps:

  1. Set up a null and an alternative hypothesis based on the claim.
  2. Collect relevant sample data to investigate the claim.
  3. Determine the correct distribution to be used to analyze the hypothesis test (e.g., normal distribution or t-distribution).
  4. Analyze the sample data to determine if the sample data is consistent with the null hypothesis. This will involve statistical calculations involving a test statistic and a p-value, which are discussed next.
  5. If the sample data is inconsistent with the null hypothesis, then the decision is to “reject the null hypothesis.” If the sample data is consistent with the null hypothesis, then the decision is to “fail to reject the null hypothesis.”

As discussed for one-sample hypothesis testing, when writing the null and alternative hypotheses, start off by translating the claim into a mathematical statement involving the unknown population parameters. The claim could correspond to either the null or the alternative hypothesis. Then write the complement of this statement in order to write the other hypothesis. The difference now as compared to the previous section is that the hypotheses will be comparing two means or two proportions instead of just one mean or one proportion.

When translating claims about a population mean, use the symbols µ1µ1 and µ2µ2 as part of the null and alternative hypotheses to represent the two unknown population means. When translating claims about a population proportions, use the symbols p1 and p2 as part of the null and alternative hypotheses to represent the two unknown population proportions. Refer to Table 4.9 and Table 4.10 for the various setups to construct the null and alternative hypotheses for two-sample hypothesis testing.

Setup A Setup B Setup C
H0:μ1=μ2Ha:μ1μ2H0:μ1=μ2Ha:μ1μ2 H0:µ1µ2Ha:µ1>µ2H0:µ1µ2Ha:µ1>µ2 H0:µ1µ2Ha:µ1<µ2H0:µ1µ2Ha:µ1<µ2
Table 4.9 Three Possible Setups for the Null and Alternative Hypotheses for a Population Mean (Two Samples)
Setup D Setup E Setup F
H0:p1=p2Ha:p1p2H0:p1=p2Ha:p1p2 H0:p1p2Ha:p1>p2H0:p1p2Ha:p1>p2 H0:p1p2Ha:p1<p2H0:p1p2Ha:p1<p2
Table 4.10 Three Possible Setups for the Null and Alternative Hypotheses for a Population Proportion (Two Samples)

Comparing Two Means

In this section, we discuss two-sample hypothesis testing for the mean when the population standard deviations are unknown. Typically, in real-world applications, population standard deviations are unknown, so this method has considerable practical appeal.

One other consideration when conducting hypothesis testing for two samples is to determine if the two samples are dependent or independent. Two samples are considered to be independent samples when the sample values from one population are not related to the sample values taken from the second population. In other words, there is not an inherent paired or matched relationship between the two samples. Two samples are considered to be dependent samples if the samples from one population can be paired or matched to the samples taken from the second population. Dependent samples are sometimes also called paired samples.

For example, if a researcher were to collect data for the ages for a random sample of 15 men and a random sample of 15 women, there would be no inherent relationship between these two samples, and so these samples are considered independent samples.

On the other hand, let’s say a researcher were to collect data on the ages of a random sample of 15 adults and also collect data on the ages of those adults’ corresponding (15) spouses/partners; these would be considered dependent samples in that there is an inherent pairing of domestic partners.

Example 4.17

Problem

Determine if the following samples are dependent or independent samples:

Sample #1: Scores on a statistics exam for 40 women college students

Sample #2: Scores on a statistics exam for 35 men college students

Example 4.18

Problem

Determine if the following samples are dependent or independent samples:

Sample #1: Blood pressure readings on a sample of 50 patients in a clinical trial before taking any medication. Sample #2: Blood pressure readings on the same group of 50 patients after taking a blood pressure–lowering medication.

Here are the requirements to use this procedure:

  1. The samples are random and independent.
  2. For each sample, the sample sizes are both at least 30, or the underlying populations are known to follow a normal distribution.
  3. The population standard deviations (σσ) are unknown; the sample standard deviations (ss) are known.

When these requirements are met, the t-distribution is used as the basis for conducting the hypothesis test.

In this discussion, we assume that the population variances are not equal, which leads to the following formula for the test statistic.

For this procedure, the test statistic will be a t-score and the standardized test statistic is calculated as follows:

t=(x1x2)(µ1µ2)s12n1+s22n2t=(x1x2)(µ1µ2)s12n1+s22n2

Where:

x1x1, x2x2are the sample means.

µ1µ1, µ2µ2 are the hypothesized values of the population means.

s1,sss1,ss are the sample standard deviations.

n1,n2n1,n2 are the sample sizes.

Note: For the t-distribution, the degrees of freedom will be the smaller of n11n11 and n21n21.

Example 4.19 illustrates this hypothesis testing process.

Example 4.19

Problem

A software engineer at a communications company claims that the average salary of women engineers is less than that of men engineers. To test the claim, a human resource administrator selects a random sample of 45 women engineers and 50 men engineers and collects statistical data as shown in Table 4.11. (Dollar amounts are in thousands of dollars.)

Women Engineers Men Engineers
Sample Mean x1=72700x1=72700 x2=76900x2=76900
Sample Standard Deviation s1=9800s1=9800 s2=10700s2=10700
Sample Size n1=45n1=45 n2=50n2=50
Table 4.11 Sample Data for Women and Men Engineers

Use this sample data to test the claim made by the software engineer. Use a level of significance of 0.05 for this analysis.

Comparing Matched Pairs Data

Hypothesis Testing for Two Samples discussed hypothesis testing for two independent samples. In this section, we turn our attention to hypothesis testing for dependent samples, often called matched pairs, paired samples, or paired data.

Recall that two samples are considered to be dependent if the samples from one population can be paired, or matched, to the samples taken from the second population.

This analysis method is often used to analyze before/after data to determine if a significant change has occurred. In a clinical trial, a medical researcher may be interested in comparing before and after blood pressure readings on a patient to assess the efficacy of a blood pressure–lowering medication. Or an employer might compare scores on a safety exam for a sample of employees before and after employees have taken a course on safety procedures and practices. A parent might want to compare before/after SAT scores for students who have taken an SAT preparation course. This type of analysis can help to assess the effectiveness and value of such training.

When conducting a hypothesis test for paired data such as before/after data, a key step will be to calculate the “difference” for each pair of data values, as follows:

Difference (d)=(before data)(after data)Difference (d)=(before data)(after data)

Note that when calculating these differences, the result can be either a positive or a negative value.

Here are the requirements to use this procedure:

  1. The samples are random and dependent (i.e., paired data).
  2. The populations are normally distributed or the number of data pairs is at least 30.

When these requirements are met, the t-distribution is used as the basis for conducting the hypothesis test.

The test statistic will be the average of the differences across the data pairs (labeled as dd), where:

d=Σdnd=Σdn

For this procedure, the standardized test statistic is calculated as follows:

t=dµdsd/nt=dµdsd/n

Where:

dd is the mean of the differences between paired data.

µdµd is the hypothesized population mean of the differences.

sdsd is the standard deviation of the differences between paired data.

nn is the number of data pairs.

Note: For the t-distribution, the degrees of freedom will be n1n1, where nn is the number of data pairs.

The setup of the null and alternative hypotheses will follow one of the setups shown in Table 4.12:

Setup A Setup B Setup C
H0:μd=kHa:μdkH0:μd=kHa:μdk H0:µdkHa:µd>kH0:µdkHa:µd>k H0:µdkHa:µd<kH0:µdkHa:µd<k
Table 4.12 Possible Setups for Null and Alternative Hypothesis for Paired Data Hypothesis Testing

Example 4.20 illustrates this hypothesis testing process.

Example 4.20

Problem

A pharmaceutical company is testing a cholesterol-lowering medication in a clinical trial. The company claims that the medication helps to lower cholesterol levels. To test the claim, a group of 10 patients is selected for a clinical trial. Before and after cholesterol measurements are taken on the same patients, where the “after” measurement is recorded after the patient has taken the medication for a six-month time period as shown in Table 4.13.

Patient Cholesterol Level Before Taking the Medication Cholesterol Level After Taking the Medication Difference (Before Reading - After Reading)
1 218 210 8
2 232 241 -9
3 259 223 36
4 265 244 21
5 248 227 21
6 298 273 25
7 263 252 11
8 281 276 5
9 290 281 9
10 271 259 12
d=13.9sd=12.414d=13.9sd=12.414
Table 4.13 Sample Data for Paired Data Example

Testing Claims for Two Proportions

Data scientists and researchers are often interested in comparing two proportions. For example, a medical researcher might be interested in comparing the percentage of people contracting a respiratory virus for those taking a vaccine versus not taking a vaccine.

In this section, we discuss two-sample hypothesis testing for proportions, where the notation p1p1 refers to the population proportion for the first group and p2p2 refers to the population proportion for the second group.

A researcher will take samples from the two populations and then calculate two sample proportions (p^p^) as follows:

p^1=x1n1p^2=x2n2p^1=x1n1p^2=x2n2

Where:
x1x1 is the number of successes in the first sample.
n1n1 is the sample size for the first sample.

x2x2 is the number of successes in the second sample.
n2n2 is the sample size for the second sample.

p^1p^1 is the sample proportion for the first sample.
p^2p^2 is the sample proportion for the second sample.

It is also useful to calculate a weighted estimate (pp) for p^1p^1 and p^2p^2, as follows:

p=x1+x2n1+n2p=x1+x2n1+n2

Here are the requirements to use this procedure:

  1. The samples are random and independent.
  2. Sample sizes are large enough to use a normal distribution as an approximation for a binomial distribution. To confirm this, ensure that the following four quantities are all at least 5: n1pn1p, n1(1p)n1(1p), n2pn2p, n2(1p)n2(1p)

When these requirements are met, the normal distribution is used as the basis for conducting the hypothesis test.

For this procedure, the test statistic will be a z-score, and the standardized test statistic is calculated as follows:

z=(p^1p^2)(p1p2)p(1p)(1n1+1n2)z=(p^1p^2)(p1p2)p(1p)(1n1+1n2)

Here is a step-by-step example to illustrate this hypothesis testing process.

Example 4.21

Problem

A drug is under investigation to address lung cancer. As part of a clinical trial, 250 people took the drug and 230 people took a placebo (a substance with no therapeutic effect ). From the results of the study, 211 people taking the drug were cancer free, and 172 taking the placebo were cancer free. At a level of significance of 0.01, can you support the claim by the pharmaceutical company that the proportion of subjects who are cancer free after taking the actual cancer drug is greater than the proportion for those who took the placebo?

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.