Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

4.1 Statistical Inference and Confidence Intervals

Principles of Data Science4.1 Statistical Inference and Confidence Intervals

Learning Outcomes

By the end of this section, you should be able to:

  • 4.1.1 Estimate parameters, create confidence intervals, and calculate sample size requirements.
  • 4.1.2 Apply bootstrapping methods for parameter estimation.
  • 4.1.3 Use Python to calculate confidence intervals and conduct hypothesis tests.

Data scientists interested in inferring the value of a population truth or parameter such as a population mean or a population proportion turn to inferential statistics. A data scientist is often interested in making generalizations about a population based on the characteristics derived from a sample; inferential statistics allows a data scientist to draw conclusions about a population based on sample data. In addition, inferential statistics is used by data scientists to assess model performance and compare different algorithms in machine learning application. Inferential statistics provides methods for generating predictive forecasting models, and this allows data scientists to generate predictions and trends to assist in effective and accurate decision-making. In this section, we explore the use of confidence intervals, which is used extensively in inferential statistical analysis.

We begin by introducing confidence intervals, which are used to estimate the range within which a population parameter is likely to fall. We discuss estimation of parameters for the mean both when the standard deviation is known and when it is not known. We discuss sample size determination, building on the sampling techniques presented in Collecting and Preparing Data. And we discuss bootstrapping, a method used to construct a confidence interval based on repeated sampling. Hypothesis Testing will move on to hypothesis testing, which is used to make inferences about unknown parameters.

Estimating Parameters with Confidence Intervals

A point estimate is a single value that is used to estimate a population parameter. For example, a sample mean is a point estimate that is representative of the true population mean in that the sample mean is used as an estimate for the unknown population mean. When researchers collect data from a sample to make inferences about a population, they calculate a point estimate based on the observed sample data. (See Survey Design and Implementation for coverage of sampling techniques.) The point estimate serves as the best guess or approximation for the parameter's actual value.

A confidence interval estimates the range within which a population parameter, such as a mean or a proportion, is likely to fall. The confidence interval provides a level of uncertainty associated with the estimate and is expressed as a range of values. A confidence interval will provide both a lower and an upper bound for the population parameter, where the point estimate is centered within the interval.

Table 4.1 describes the point estimates and corresponding population parameters for the mean and proportion.

Population Parameter Point Estimate
Mean Population mean is denoted as µµ. Point estimate is the sample mean xx.
Proportion Population proportion is denoted as pp. Point estimate is the sample proportion p^p^.
Table 4.1 Population Parameters and Point Estimates for Mean and Proportion

Let’s say a researcher is interested in estimating the mean income for all residents of California. Since it is not feasible to collect data from every resident of California, the researcher selects a random sample of 1,000 residents and calculates the sample mean income for these 1,000 people. This average income estimate from the sample then provides an estimate for the population mean income of all California residents. The sample mean is chosen as the point estimate for the population mean in that the sample mean provides the most unbiased estimate for the population mean. In the same way, the sample proportion provides the most unbiased estimate for the population proportion. An unbiased estimator is a statistic that provides an accurate estimate for the corresponding population parameter without overestimating or underestimating the parameter.

In order to calculate a confidence interval, two quantities are needed:

  • The point estimate
  • The margin of error

As noted, the point estimate is a single number that is used to estimate the population parameter. The margin of error (usually denoted by E) provides an indication of the maximum error of the estimate. The margin of error can be viewed as the maximum distance around the point estimate where the population parameter exists based on a specified confidence.

Once the point estimate and margin of error are determined, the confidence interval is calculated as follows:

Lower Bound of the Confidence Interval=Point EstimateMargin of ErrorUpper Bound of the Confidence Interval=Point Estimate+Margin of ErrorLower Bound of the Confidence Interval=Point EstimateMargin of ErrorUpper Bound of the Confidence Interval=Point Estimate+Margin of Error

The margin of error reflects the desired confidence level of the researcher. The confidence level is the probability that the interval estimate will contain the population parameter, given that the estimation process on the parameter is repeated over and over. Confidence levels typically range from 80% to 95% confidence.

In addition to the confidence level, both the variability of the sample and the sample size will affect the margin of error.

Once the confidence interval has been calculated, a concluding statement is made that reflects both the confidence level and the unknown population parameter. The concluding statement indicates that there is a certain level of confidence that the population parameter is contained within the bounds of the confidence interval.

As an example, consider a data scientist who is interested in forecasting the median income for all residents of California. Income data is collected from a sample of 1,000 residents, and the median income level is $68,500. Also assume the margin of error for a 95% confidence interval is $4,500 (more details on the calculation of margin of error are provided next). Now the data scientist can construct a 95% confidence interval to forecast income levels as follows:

Lower Bound of the Confidence Interval=Point EstimateMargin of Error=$68500$4500Upper Bound of the Confidence Interval=Point Estimate+Margin of Error=$68500+$4500=$73000Lower Bound of the Confidence Interval=Point EstimateMargin of Error=$68500$4500Upper Bound of the Confidence Interval=Point Estimate+Margin of Error=$68500+$4500=$73000

The data scientist can then state the following conclusion: There is 95% confidence that the forecast for median income for all residents of California is between $64,000 and $73,000.

Example 4.1

Problem

A medical researcher is interested in estimating the population mean age for patients suffering from arthritis. A sample of 100 patients is taken, and the sample mean is determined to be 64 years old. Assume that the corresponding margin of error for a 95% confidence interval is calculated to be 4 years.

Calculate the confidence interval and provide a conclusion regarding the confidence interval.

Sampling Distribution for the Mean

A researcher takes repeated samples of size 1,000 from the residents of New York to collect data on mean income of residents of New York.

For each sample of size 1,000, we can calculate a sample mean, xx. If the researcher were to take 50 such samples (each of sample size 1,000), a series of sample means can be calculated:

x1,x2,x3,,x50x1,x2,x3,,x50

A probability distribution based on all possible random samples of a certain size from a population—or sampling (or sample) distribution—can then be analyzed. For example, we can calculate the mean of these sample means, and we can calculate the standard deviation of these sample means.

There are two important properties of the sample distribution of these sample means:

  1. The mean of the sample means (notated as µxµx) is equal to the population mean µµ.
    1. Written mathematically: µx=µµx=µ
  2. The standard deviation of the sample means (notated as σxσx) is equal to the population standard deviation σσ divided by the square root of the sample size nn.
    1. Written mathematically: σx=σnσx=σn

The central limit theorem describes the relationship between the sample distribution of sample means and the underlying population. This theorem is an important tool that allows data scientists and researchers to use sample data to generate inferences for population parameters.

Conditions of the central limit theorem:

  • If random samples are taken from any population with mean µµ and standard deviation, where the sample size is at least 30, then the distribution of the sample means approximates a normal (bell-shaped) distribution.
  • If random samples are taken from a population that is normally distributed with mean µµ and standard deviation σσ, then the distribution of the sample means approximates a normal (bell-shaped) distribution for any sample size

Example 4.2

Problem

An economist takes random samples of size 50 to estimate the mean salary of chemical engineers. Assume that the population mean salary is $85,000 and the population standard deviation is $9,000. It is unknown if the distribution of salaries follows a normal distribution.

Calculate the mean and standard deviation of the sampling distribution of the sample means and comment on the shape of the distribution for the sample means.

Confidence Interval for the Mean When the Population Standard Deviation Is Known

Although in many situations the population standard deviation is unknown, in some cases a reasonable estimate for the population standard deviation can be obtained from past studies or historical data.

Here is what is needed to calculate the confidence interval for the mean when the population standard deviation is known:

  1. A random sample is selected from the population.
  2. The sample size is at least 30, or the underlying population is known to follow a normal distribution.
  3. The population standard deviation (σσ) is known.

Once these conditions are met, the margin of error EE is calculated according to the following formula:

E=zcσnE=zcσn

Where:

zczc is called the critical value of the normal distribution and is calculated as the z-score, which includes the area corresponding to the confidence level between zczc and +zc+zc. For example, for a 95% confidence interval, the corresponding critical value is zc=1.96zc=1.96 since an area of 0.95 under the normal curve is contained between the z-scores of -1.96-1.96 and +1.96+1.96.

σσ is the population standard deviation.

nn is the sample size.

Note: σnσn is called the standard error of the mean.

Typical values of zczc for various confidence levels are shown in Table 4.2.

Confidence Level Value of zczc
80% confidence 1.280
90% confidence 1.645
95% confidence 1.960
99% confidence 2.575
Table 4.2 Typical Values of zczc for Various Confidence Levels

Graphically, the critical values can be marked on the normal distribution curve, as shown in Figure 4.2. (See Discrete and Continuous Probability Distributions for a review of the normal distribution curve.) Figure 4.2 is an example for a 95% confidence interval where the area of 0.95 is centered as the area under the standard normal curve showing the values of zczc and +zc+zc. Since the standard normal curve is symmetric, and since the total area under the curve is known to be 1, the area in each of the two tails can be calculated to be 0.025. Thus, the critical value is that z-score which cuts off an area of 0.025 in the upper tail of the standard normal curve.

A standard normal curve for 95 percent confidence interval with area under the curve shaded blue. Critical values of negative z_c and z_c, labeled negative 1.96 and 1.96, respectively.
Figure 4.2 Critical Values of zc Marked on Standard Normal Curve for 95% Confidence Interval

Once the margin of error has been calculated, the confidence interval can be calculated as follows:

Sample Mean±Margin of ErrorSample Mean±Margin of Error

Note that this is the general format for constructing any confidence interval, namely the margin of error is added and subtracted to the sample statistic to generate the upper and lower bounds of the confidence interval, respectively. A sample statistic describes some aspect of the sample, such as a sample mean or sample proportion.

General Form of a Confidence Interval: Sample Statistic±Margin of ErrorSample Statistic±Margin of Error

Example 4.3

Problem

A college professor collects data on the amount of time spent on homework assignments per week for a sample of 50 students in a statistics course. From the sample of 50 students, the mean amount of time spent on homework assignments per week is 12.5 hours. The population standard deviation is known to be 6.3 hours.

The professor would like to forecast the amount of time spent on homework in future semesters. Create a forecasted confidence interval using both a 90% and 95% confidence interval and provide a conclusion regarding the confidence interval. Also compare the widths of the two confidence intervals. Which confidence interval is wider?

Confidence Interval for the Mean When the Population Standard Deviation Is Unknown

A confidence interval can still be determined when the population standard deviation is unknown by calculating a sample standard deviation based on the sample data. This is actually the more common application of confidence intervals.

Here are the conditions required to use this procedure:

  1. A random sample is selected from the population.
  2. The sample size is at least 30, or the underlying population is known to follow a normal distribution.
  3. The population standard deviation (σσ) is unknown; the sample standard deviation (s)(s) can be calculated.

Once these requirements are met, the margin of error (E) is calculated according to the following formula:

E=tcsnE=tcsn

tctc is called the critical value of the t-distribution. The t-distribution is a bell-shaped, symmetric distribution similar to the normal distribution, though the t-distribution has “thicker tails” as compared to the normal distribution. The comparison of the normal and t-distribution curves is shown in Figure 4.3.

Graph showing a normal distribution curve with a peak at zero, compared to two t-distributions with degrees of freedom v=30 and v=10, both also centered at zero.
Figure 4.3 Comparison of t-Distribution and Normal Distribution Curves

The t-distribution is actually a family of curves, determined by a parameter called degrees of freedom (df), where df is equal to n1n1. The critical value tctc is similar to a z-score and specifies the area under the t-distribution curve corresponding to the confidence level between tctc and +tc+tc. Values of tctc can be obtained using either a look-up table or technology.

For example, for a 95% confidence interval and sample size of 30, the corresponding critical value is tc=2.045tc=2.045 since an area of 0.95 under the t-distribution curve is contained between the t-scores of -2.045-2.045 and +2.045+2.045.

ss is the sample standard deviation.

nn is the sample size.

df is degrees of freedom, where df=n1df=n1.

Typical values of tctc for various confidence levels and degrees of freedom are shown in Table 4.3.

Confidence Level
Degrees of Freedom (df) 90% Confidence 95% Confidence 99% Confidence
1 6.314 12.706 63.657
2 2.920 4.203 9.925
3 2.353 3.182 5.841
4 2.132 2.776 4.604
5 2.015 2.571 4.032
10 1.812 2.228 3.169
15 1.753 2.131 2.947
20 1.725 2.086 2.845
25 1.708 2.060 2.787
30 1.697 2.042 2.750
Table 4.3 Typical Values of tctc for Various Confidence Levels and Degrees of Freedom

Note that Python can be used to calculate these critical values. Python provides a function called t.ppf() that generates the value of the t-distribution corresponding to a specified area under the t-distribution curve and specified degrees of freedom. This function is part of the scipy.stats library.

The syntax for the function is:
t.ppf(area to left, degrees of freedom)

For example, for a 95% confidence interval, there is an area of 0.95 centered under the t-distribution curve, which leaves a remaining area of 0.05 for the two tails of the distribution, which implies an area of 0.025 in each tail. To find a critical value corresponding to the upper 0.025 area, note that the area to the left of this critical value will then be 0.975.

Here is an example of Python code to generate the critical value tctc for a 95% confidence interval and 15 degrees of freedom.

Python Code

    # import function from scipy.stats library
    from scipy.stats import t
    
    # define parameters called area to left and degrees of freedom df
    Area_to_left = 0.975
    df = 15
    
    # use t.ppf function to calculate critical values associated with t-
    # use round function to round answer to 3 decimal places
    round (t.ppf(Area_to_left, df), 3)
    
 

The resulting output will look like this:

2.131

Once the margin of error is calculated, the confidence interval is formed in the same way as the previous section, namely:

Lower Bound of the Confidence Interval=Point EstimateMargin of ErrorUpper Bound of the Confidence Interval=Point Estimate+Margin of ErrorLower Bound of the Confidence Interval=Point EstimateMargin of ErrorUpper Bound of the Confidence Interval=Point Estimate+Margin of Error

Example 4.4

Problem

A company’s human resource administrator wants to estimate the average commuting distance for all 5,000 employees at the company. Since it is impractical to collect commuting distances from all 5,000 employees, the administrator decides to sample 16 employees and collects data on commuting distance from each employee in the sample. The sample data indicates a sample mean of 15.8 miles with a standard deviation of 3.2 miles. Calculate a 99% confidence interval and provide a conclusion regarding the confidence interval.

Confidence Interval for Proportions

We can also calculate a confidence interval for a population proportion based on the use of sample data. The basis for the confidence interval will be the application of a normal approximation to the binomial distribution. Recall from Discrete and Continuous Probability Distributions that a binomial distribution is a probability distribution for a discrete random variable where there are only two possible outcomes of an experiment. A proportion measures the number of successes in a sample. For example, if 10 survey respondents out of 50 indicate they are planning to travel internationally within the next 12 months, then the proportion is 10 out of 50, which is 0.2, or 20%. Note that the term success does not necessarily imply a positive outcome. For example, a researcher might be interested in the proportion of smokers among U.S. adults, and the number of smokers would be considered the number of successes.

Some terminology will be helpful:
pp represents the population proportion, which is typically unknown.
p^p^ represents the sample proportion.
xx represents the number of successes in the sample.
nn represents the sample size.

Here are the requirements to use this procedure:

  1. A random sample is selected from the population.
  2. Verify that the normal approximation to the binomial distribution is appropriate by ensuring that both np^np^ and n(1p^)n(1p^) are both at least 5, where p^p^ represents the sample proportion.

The sample proportion p^p^ is calculated as the number of successes divided by the sample size:

p^=xnp^=xn

When these requirements are met, the margin of error (E) for a confidence interval for proportions is calculated according to the following formula:

E=zcp^(1p^)nE=zcp^(1p^)n

Where:

zczc is called the critical value of the normal distribution and is calculated as the z-score, which includes the area corresponding to the confidence level between zczc and +zc+zc.

p^p^ represents the sample proportion.

nn is the sample size.

Once the margin of error is calculated, the confidence interval is formed in the same way as the previous section. In this case, the point estimate is the sample proportion p^p^:

Lower Bound of the Confidence Interval=Point EstimateMargin of ErrorUpper Bound of the Confidence Interval=Point Estimate+Margin of ErrorLower Bound of the Confidence Interval=Point EstimateMargin of ErrorUpper Bound of the Confidence Interval=Point Estimate+Margin of Error

Example 4.5

Problem

A medical researcher wants to know if there has been a statistically significant change in the proportion of smokers from five years ago, when the proportion of adult smokers in the United States was approximately 28%. The researcher selects a random sample of 1,500 adults, and of those, 360 respond that they are smokers. Calculate a 95% confidence interval for the true population proportion of adults who smoke. Also, determine if there has been a statistically significant change in the proportion of smokers as compared to five years ago.

Sample Size Determination

When collecting sample data in order to construct a confidence interval for a mean or proportion, how does the researcher determine the optimal sample size? Too small of a sample size may lead to a wide confidence interval that is not very useful. Too large of a sample size can result in wasted resources if a smaller sample size would be sufficient. Sampling is covered in some depth in Handling Large Datasets. Here, we review methods to determine minimum sample size requirements when constructing confidence intervals for means or proportions. Note that the desired margin of error plays a key role in this sample size determination.

Sample Size for Confidence Interval for the Mean

When determining a confidence interval for a mean, a researcher can use the margin of error formula for the mean; solving this formula algebraically for the sample size (nn), the following minimum sample size formula can be used to determine the minimum sample size needed to achieve a certain margin of error:

Sample size formula for confidence interval for mean:

E=zcσnn=(zcσE)2E=zcσnn=(zcσE)2

Where:
zczc is the critical value of the normal distribution.
σσ is the population standard deviation.
EE is the desired margin of error.

Note that for sample size calculations, sample size results are rounded up to the next higher whole number. For example, if the formula previously shown results in a sample size of 59.2, then the sample size will be rounded to 60.

Example 4.6

Problem

A benefits analyst is interested in a 90% confidence interval for the mean salary for chemical engineers. What sample size should the analyst use if a margin of error of $1,000 is desired? Assume a population standard deviation of $8,000.

Sample Size for Confidence interval for a Proportion

In Example 4.5, how would the researcher know the optimal sample size to use to collect sample data on the proportion of adults who are smokers? The margin of error formula can be used to derive the sample size for a proportion as follows:

n=p^(1p^)(zcE)2n=p^(1p^)(zcE)2

Where:
zczc is the critical value of the normal distribution.
p^p^ is the sample proportion.
EE is the desired margin of error.

Notice in this formula, it’s assumed that some prior estimate for the sample proportion p^p^ is available, perhaps from historical data or previous studies. If a prior estimate for the sample proportion is not available, then use the value of 0.5 for p^p^.

Determining Sample Size

When determining sample size needed for a confidence interval for a proportion:

If a prior estimate for the sample proportion is available, then that prior estimate should be utilized.

  • If a prior estimate for the sample proportion is not available, then use p^=0.5p^=0.5.

Example 4.7

Problem

Political candidate Smith is planning a survey to determine a 95% confidence interval for the proportion of voters who plan to vote for Smith. How many people should be surveyed? Assume a margin of error is 3%.

  1. Assume there is no prior estimate for the proportion of voters who plan to vote for candidate Smith.
  2. Assume that from prior election results, approximately 42% of people previously voted for candidate Smith.

Bootstrapping Methods

In the earlier discussion of confidence intervals, certain requirements needed to be met in order to use the estimation methods. For example, when calculating the confidence interval for a mean, this requirement was indicated:

  • The sample size is at least 30, or the underlying population is known to follow a normal distribution.

When calculating the confidence interval for a proportion, this requirement was noted:

  • Verify that the normal approximation to the binomial distribution is appropriate by ensuring that both np^np^ and n(1p^)n(1p^) are at least 5.

What should a researcher do when these requirements are not met? Fortunately, there is another option called “bootstrapping” that can be used to find confidence intervals when the underlying distribution is unknown or if one of the conditions is not met. This bootstrapping method involves repeatedly taking samples with replacement.

Sampling with replacement means that when a sample is selected, the sample is replaced back in the dataset before selecting the next sample. For example, a casino dealer plans to take a sample of two cards from a standard 52-card deck. If the sampling is to be done with replacement, then after selecting the sample of the first card, this card is then returned to the deck before the second card is selected. This implies that the same data value can appear multiple times in a given sample since once a data value is selected for the sample, it is replaced back in the dataset and is eligible to be selected again for the sample.

For example, a researcher who has collected 100 samples to use for inference may want to estimate a confidence interval for the population mean. The researcher can then resample, with replacement, from this limited set of 100 observations. They would repeatedly sample 100 and calculate the sample mean. These means would then represent the distribution of means (remember the discussion of sampling distribution). Typically, for bootstrapping, repeated samples are taken hundreds or thousands of times and then the sample mean is calculated for each sample.

The term “bootstrapping” comes from the old saying “pull yourself up by your bootstraps” to imply a task that was accomplished without any outside help. In the statistical sense, bootstrapping refers to the ability to estimate parameters based solely on one sample of data from the population without any other assumptions.

The bootstrapping method is considered a nonparametric method since the technique makes no assumptions about the probability distribution from which the data are sampled. (Parametric methods, by contrast, assume a specific form for the underlying distribution and require estimating parameters.)

Since bootstrapping requires a large number of repeated samples, software (such as Excel, Python, or R) is often used to automate the repetitive sampling procedures and construct the confidence intervals.

The bootstrapping procedure for confidence interval estimation for a mean or proportion is as follows:

  1. Start out with a random sample of size nn. Collect many random bootstrap samples of size nn—for example, hundreds or thousands of such samples. Keep in mind that the sampling is done with replacement.
  2. For each sample, calculate the sample statistic, which is the sample mean xx or the sample proportion p^p^.
  3. Rank order the sample statistics from smallest to largest.
  4. For a 95% confidence interval, find the percentiles P2.5 and P97.5 in the ranked data. These values establish the 95% confidence interval.
  5. For a 90% confidence interval, find the percentiles P5 and P95. These values establish the 90% confidence interval.

Example 4.8

Problem

A college administrator is developing new marketing materials to increase enrollment at the college, and the administrator is interested in a 90% confidence interval for the mean age of students attending the college.

The administrator believes the underlying distribution of ages is skewed (does not follow a normal distribution), so a bootstrapping method will be used to construct the confidence interval. The administrator selects a random sample of 20 students and records their ages as shown in Table 4.4. Use the bootstrapping method to construct the 90% confidence interval by taking repeated samples of size 10.

Student ID Student Age
001 22
002 24
003 21
004 34
005 29
006 23
007 21
008 20
009 19
010 21
011 25
012 28
013 22
014 37
015 24
016 31
017 23
018 19
019 26
020 20
Table 4.4 Ages of 20 Randomly Selected Students from One College

Exploring Further

Simulating Confidence Intervals

A number of websites and resources, such as the online textbook Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/) (project leader: David M. Lane, Rice University), allow the user to simulate confidence intervals for means or proportions using simulated sample data. It is useful to observe how many intervals contain an assumed value for the population mean or population proportion. The Stapplet website provides a similar tool.

Using Python to Calculate Confidence Intervals

When calculating confidence intervals, data scientists typically make use of technology to help streamline and automate the analysis. Python provides built-in functions for confidence interval calculations, and several examples are shown in Table 4.5. Students are encouraged to try examples themselves and experiment to use Python to assist with these calculations.

Table 4.5 provides a summary of various functions available within the SciPy library for confidence interval calculations:

Usage Python Function Name Syntax
Calculate confidence interval for the mean when population standard deviation is known, given sample mean, population standard deviation, and sample size (uses normal distribution).
norm.interval()
norm.interval(conf_level,
sample_mean,standard_dev/sqrt
(n))
Calculate confidence interval for the mean when population standard deviation is unknown, given sample mean, sample standard deviation, and sample size (uses t-distribution).
t.interval()
t.interval(conf_level,
degrees_freedom,
sample_mean,standard_dev/sqrt
(n))
Calculate confidence interval for a proportion (uses normal distribution).
proportion_confint()
proportion_confint(success,
sample_size,
1 – confidence_level)
Table 4.5 Python Functions for Confidence Intervals

Example 4.9

Problem

Repeat Example 4.3 to calculate a 90% confidence interval, but use Python functions to calculate the confidence interval.

Recall from Example 4.3, the sample mean is 12.5 hours with a population standard deviation of 6.3 hours. The sample size is 50 students, and we are interested in calculating a 90% confidence interval.

Example 4.10

Problem

Repeat Example 4.4 to calculate a 99% confidence interval, but use Python functions to calculate the confidence interval.

Recall from Example 4.4, the sample mean is 15.8 miles with a standard deviation of 3.2 miles. The sample size is 26 employees, and we are interested in calculating a 99% confidence interval.

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.