Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

4.3 Correlation and Linear Regression Analysis

Principles of Data Science4.3 Correlation and Linear Regression Analysis

Learning Outcomes

By the end of this section, you should be able to:

  • 4.3.1 Create scatterplots and calculate and interpret correlation coefficients.
  • 4.3.2 Perform linear regression and determine the best-fit linear equation.
  • 4.3.3 Use Python to calculate correlation coefficients and determine equations of linear regression models.

We briefly introduced correlation analysis at the beginning of this chapter, but now we want to dig in a little deeper. Data scientists are often interested in knowing if there are relationships, or a correlation, between two numeric quantities. For example, a medical researcher might be interested in investigating if there is a relationship between age and blood pressure. A college professor might be interested to know if there is a relationship between time spent on social media by students and corresponding academic grades.

Correlation analysis allows for the determination of a statistical relationship between two numeric quantities, or variables—an independent variable and a dependent variable. The independent variable is the variable that you can change or control, while the dependent variable is what you measure or observe to see how it responds to those changes. Regression analysis takes this a step further by quantifying the relationship between the two variables, and it can be used to predict one quantity based on a second quantity, assuming there is a significant correlation between the two quantities. For example, the value of a car can be predicted based on the age of the car.

Earlier we noted many business-related applications where correlation and regression analysis are used. For instance, regression analysis can be used to establish a mathematical equation that relates a dependent variable (such as sales) to an independent variable (such as advertising expenditure). But it has many applications apart from business as well. An automotive engineer is interested in the correlation between outside temperature and battery life for an electric vehicle, for instance.

Our discussion here will focus on linear regression—analyzing the relationship between one dependent variable and one independent variable, where the relationship can be modeled using a linear equation.

Scatterplots and Correlation

In correlation analysis, we study the relationship between bivariate data, which is data collected on two variables where the data values are paired with one another. Correlation measures the association between two numeric variables. We may be interested in knowing if there is a correlation between bond prices and interest rates or between the age of a car and the value of the car. To investigate the correlation between two numeric quantities, the first step is to collect (x,y)(x,y) data for the two numeric quantities of interest and then create a scatterplot that will graph the (x,y)(x,y) ordered pairs. The independent, or explanatory, quantity is labeled the xx-variable, and the dependent, or response, quantity is labeled the yy-variable.

For example, let’s say that a financial analyst wants to know if the price of Nike stock is correlated with the value of the S&P 500 (Standard & Poor’s 500 stock market index). To investigate this, monthly data can be collected for Nike stock prices and value of the S&P 500 for a period of time, and a scatterplot can be created and examined. A scatterplot, or scatter diagram, is a graphical display intended to show the relationship between two variables. The setup of the scatterplot is that one variable is plotted on the horizontal axis and the other variable is plotted on the vertical axis. Each pair of data values is considered as an (x,y)(x,y) point, and the various points are plotted on the diagram. A visual inspection of the plot is then made to detect any patterns or trends on the scatter diagram. Table 4.14 shows the relationship between the Nike stock price and its S&P value on a monthly basis over a one-year time period.

Date S&P 500 (xx) Nike Stock Price (yy)
Month 1 2912.43 87.18
Month 2 3044.31 98.58
Month 3 3100.29 98.05
Month 4 3271.12 97.61
Month 5 3500.31 111.89
Month 6 3363.00 125.54
Month 7 3269.96 120.08
Month 8 3621.63 134.70
Month 9 3756.07 141.47
Month 10 3714.24 133.59
Month 11 3811.15 134.78
Month 12 3943.34 140.45
Table 4.14 Nike Stock Price ($) and Value of S&P 500 over a One-Year Time Period
(source: Yahoo! Finance)

To assess linear correlation, examine the graphical trend of the data points on the scatterplot to determine if a straight-line pattern exists (see Figure 4.5). If a linear pattern exists, the correlation may indicate either a positive or a negative correlation. A positive correlation indicates that as the independent variable increases, the dependent variable tends to increase as well, or as the independent variable decreases, the dependent variable tends to decrease (the two quantities move in the same direction). A negative correlation indicates that as the independent variable increases, the dependent variable decreases, or as the independent variable decreases, the dependent variable increases (the two quantities move in opposite directions). If there is no relationship or association between the two quantities, where one quantity changing does not affect the other quantity, we conclude that there is no correlation between the two variables.

A scatter plot for Nike stock against S&P 500. The diagram shows that the Nike stock price rises from approximately 85 to 140 as the S&P 500 rises from approximately 2900 to 4000. The data points generally align along an upwardly sloping line.
Figure 4.5 Scatterplot of Nike Stock Price ($) and Value of S&P 500
(source: Yahoo! Finance)

From the scatterplot in the Nike stock versus S&P 500 example, we note that the trend reflects a positive correlation in that as the value of the S&P 500 increases, the price of Nike stock tends to increase as well.

When inspecting a scatterplot, it may be difficult to assess a correlation based on a visual inspection of the graph alone. A more precise assessment of the correlation between the two quantities can be obtained by calculating the numeric correlation coefficient (referred to using the symbol rr).

The correlation coefficient is a measure of the strength and direction of the correlation between the independent variable xx and the dependent variable yy.

The formula for rr is shown; however, software is typically used to calculate the correlation coefficient.

r=nΣxy(Σx)(Σy)nΣx2(Σx)2nΣy2(Σy)2r=nΣxy(Σx)(Σy)nΣx2(Σx)2nΣy2(Σy)2

where nn refers to the number of data pairs and the symbol ΣxΣx indicates to sum the x-values.

Note that this value of “rr” is sometimes referred to as the “Pearson correlation coefficient.”

Table 4.15 provides a step-by-step procedure on how to calculate the correlation coefficient rr.

Step Representation in Symbols
1. Calculate the sum of the x-values. ΣxΣx
2. Calculate the sum of the y-values. ΣyΣy
3. Multiply each x-value by the corresponding y-value and calculate the sum of these xyxy products. ΣxyΣxy
4. Square each x-value and then calculate the sum of these squared values. Σx2Σx2
5. Square each y-value and then calculate the sum of these squared values. Σy2Σy2
6. Determine the value of nn, which is the number of data pairs. n
7. Use these results to then substitute into the formula for the correlation coefficient. r=nΣxy(Σx)(Σy)nΣx2(Σx)2nΣy2(Σy)2r=nΣxy(Σx)(Σy)nΣx2(Σx)2nΣy2(Σy)2
Table 4.15 Steps for Calculating the Correlation Coefficient

Note that since rr is calculated using sample data, rr is considered a sample statistic and is used to measure the strength of the correlation for the two population variables. Recall that sample data indicates data based on a subset of the entire population.

As mentioned, given the complexity of this calculation, software is typically used to calculate the correlation coefficient. There are several options for calculating the correlation coefficient in Python. The example shown uses the scipy.stats library in Python which includes the function pearsonr() for calculating the Pearson correlation coefficient (rr).

We will create two numerical arrays in Python using the np.array() function and then use the pearsonr() function to calculate the Pearson correlation coefficient for the (x,y)(x,y) data shown in Table 4.15.

The following Python code provides the correlation coefficient for the Nike Stock Price dataset as 0.923.

Python Code

        
        # import libraries
        import numpy as np
        from scipy.stats  import pearsonr
         
        # establish x and y data, x-data is S&P500 value, y-data is Nike Stock Price
        
        SP500 = np.array([2912.43, 3044.31, 3100.29, 3271.12, 3500.31, 3363.00, 3269.96, 3621.63, 3756.07, 3714.24, 3811.15, 3943.34])
        
        Nike = np.array([87.18, 98.58, 98.05, 97.61, 111.89, 125.54, 120.08, 134.70, 141.47, 133.59, 134.78, 140.45])
        
        #pearson r returns both value of r and corresponding p-value
        r, p = pearsonr(SP500, Nike)
        
        #print value of r , rounded to 3 decimal places
        
        print("Correlation coefficient: ", round(r, 3))
        

The resulting output will look like this:

Correlation coefficient: 0.923

Interpret a Correlation Coefficient

Once the value of rr is calculated, this measurement provides two indicators for the correlation:

  1. the strength of the correlation based on the value of rr
  2. the direction of the correlation based on the sign of rr

The value of rr gives us this information:

  • The value of rr is always between -1-1 and +1+1: 1r11r1.
  • The size of the correlation rr indicates the strength of the linear relationship between the two variables. Values of rr close to -1-1 or to +1+1 indicate a stronger linear relationship.
  • If r=0r=0, there is no linear relationship between the two variables (no linear correlation).
  • If r=1r=1, there is perfect positive correlation. If r=1r=1, there is perfect negative correlation. In both of these cases, all the original data points lie on a straight line.

The sign of rr gives us this information:

  • A positive value of rr means that when xx increases, yy tends to increase, and when xx decreases, yy tends to decrease (positive correlation).
  • A negative value of rr means that when xx increases, yy tends to decrease, and when xx decreases, yy tends to increase (negative correlation).

From the statistical results shown in the Python output, the correlation coefficient rr is 0.923, which indicates that the relationship between Nike stock and the value of the S&P 500 over this time period represents a strong, positive correlation.

Test a Correlation Coefficient for Significance

The correlation coefficient, rr, tells us about the strength and direction of the linear relationship between xx and yy. The sample data are used to compute rr, the correlation coefficient for the sample. If we had data for the entire population (that is, all measurements of interest), we could find the population correlation coefficient, which is labeled as the Greek letter ρρ (pronounced “rho”). But because we have only sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, rr, is our estimate of the unknown population correlation coefficient.

  • ρ=population correlation coefficient (unknown)ρ=population correlation coefficient (unknown)
  • r=sample correlation coefficient (known; calculated from sample data)r=sample correlation coefficient (known; calculated from sample data)

An important step in the correlation analysis is to determine if the correlation is significant. By this, we are asking if the correlation is strong enough to allow meaningful predictions for yy based on values of xx. One method to test the significance of the correlation is to employ a hypothesis test. The hypothesis test lets us decide whether the value of the population correlation coefficient ρρ is close to zero or significantly different from zero. We decide this based on the sample correlation coefficient rr and the sample size n.

If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is significant.

  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between the two variables because the correlation coefficient is significantly different from zero.
  • What the conclusion means: There is a significant linear relationship between the two variables. If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that correlation coefficient is not significant.

A hypothesis test can be performed to test if the correlation is significant. A hypothesis test is a statistical method that uses sample data to test a claim regarding the value of a population parameter. In this case, the hypothesis test will be used to test the claim that the population correlation coefficient ρρ is equal to zero.

Use these hypotheses when performing the hypothesis test:

  • Null hypothesis: H0:ρ=0H0:ρ=0
  • Alternate hypothesis: Ha:ρ0Ha:ρ0

The hypotheses can be stated in words as follows:

  • Null hypothesis H0H0: The population correlation coefficient is not significantly different from zero. There is not a significant linear relationship (correlation) between xx and yy in the population.
  • Alternate hypothesis HaHa: The population correlation coefficient is significantly different from zero. There is a significant linear relationship (correlation) between xx and yy in the population.

A quick shorthand way to test correlations is the relationship between the sample size and the correlation. If |r|2n|r|2n, then this implies that the correlation between the two variables demonstrates that a linear relationship exists and is statistically significant at approximately the 0.05 level of significance. As the formula indicates, there is an inverse relationship between the sample size and the required correlation for significance of a linear relationship. With only 10 observations, the required correlation for significance is 0.6325; for 30 observations, the required correlation for significance decreases to 0.3651; and at 100 observations, the required level is only 0.2000.

Interpreting the Significance of the Sample Correlation Coefficient

  • If rr is significant and the scatterplot shows a linear trend, the line can be used to predict the value of yy for values of xx that are within the domain of observed x-values.
  • If rr is not significant OR if the scatterplot does not show a linear trend, the line should not be used for prediction.
  • If rr is significant and the scatterplot shows a linear trend, the line may not be appropriate or reliable for prediction outside the domain of observed x-values in the data.

Example 4.22

Problem

Suppose that the chief financial officer (CFO) of a corporation is investigating the correlation between stock prices and unemployment rate over a period of 10 years and finds the correlation coefficient to be -0.68-0.68. There are 10 (x,y)(x,y) data points in the dataset. Should the CFO conclude that the correlation is significant for the relationship between stock prices and unemployment rate based on a level of significance of 0.05?

Linear Regression

A regression model is typically developed once the correlation coefficient has been calculated and a determination has been made that the correlation is significant. The regression model can then be used to predict one quantity (the dependent variable) based on the other (the independent variable). For example, a business may want to establish a correlation between the amount the company spent on advertising (the independent variable) versus its recorded sales (the dependent variable). If a strong enough correlation is established, then the business manager can predict sales based on the amount spent on advertising for a given period. In this discussion we will focus on linear regression, where a straight line is used to model the relationship between the two variables. Once a straight-line model is developed, this model can then be used to predict the value of the dependent variable for a specific value of the independent variable.

Method of Least Squares and Residuals

To create a linear model that best fits the (x,y)(x,y) points on a scatterplot, researchers generally use the least squares method based on a mathematical algorithm designed to minimize the squared distances from the (x,y)(x,y) data points to the fitted line.

Recall from algebra that the equation of a straight line is given by:

y=mx+by=mx+b

where mm is the slope of the line and bb is the y-intercept of the line.

The slope measures the steepness of the line, and the y-intercept is that point on the yy-axis where the graph crosses, or intercepts, the yy-axis.

In linear regression analysis, the equation of the straight line is written in a slightly different way using the model

y^=a+bxy^=a+bx

In this format, bb is the slope of the line, and aa is the y-intercept of the line. The notation y^y^ is called yy-hat and is used to indicate a predicted value of the dependent variable yy for a certain value of the independent variable xx.

If a line extends uphill from left to right, the slope is a positive value (see Figure 4.6; if the line extends downhill from left to right, the slope is a negative value).

Three separate boxes show line graphs labeled a, b, and c, respectively. Graph a shows a line sloping upward to the right. Graph b shows a horizontal line. Graph c shows a line sloping downward to the right.
Figure 4.6 Three examples of graphs of y^=a+bxy^=a+bx. (a) If b>0b>0, the line slopes upward to the right. (b) If b=0b=0, the line is horizontal. (c) If b<0b<0, the line slopes downward to the right.

When generating the equation of a line in algebra using y=mx+by=mx+b, two (x,y)(x,y) points were required to generate the equation. However, in regression analysis, all the (x,y)(x,y) points in the dataset will be utilized to develop the linear regression model.

The first step in any regression analysis is to create the scatterplot. Then proceed to calculate the correlation coefficient rr and check this value for significance. If we think that the points show a linear relationship, we would like to draw a line on the scatterplot. This line can be calculated through a process called linear regression. However, we only calculate a regression line if one of the variables helps to explain or predict the other variable. If xx is the independent variable and yy the dependent variable, then we can use a regression line to predict yy for a given value of xx.

As an example of a regression equation, assume that a correlation exists between the monthly amount spent on advertising and the monthly revenue for a Fortune 500 company. After collecting (x,y)(x,y) data for a certain time period, the company determines the regression equation is of the form

y^=9376.7+61.8xy^=9376.7+61.8x

where xx represents the monthly amount spent on advertising (in thousands of dollars), y^y^ represents the monthly revenues for the company (in thousands of dollars), the slope is 61.8, and the y-intercept is 9376.7.

A scatterplot of the (x,y)(x,y) data is shown in Figure 4.7.

A scatter plot of revenue against ad spend. The diagram shows revenue increasing from approximately 12 million dollars to 19 million dollars as advertising spend increases from approximately 50,000 dollars to 150,000 dollars.
Figure 4.7 Scatterplot of Revenue versus Advertising for a Fortune 500 Company ($000s)

The company would like to predict the monthly revenue if its executives decide to spend $150,000 on advertising next month. To determine the estimate of monthly revenue, let x=150x=150 in the regression equation and calculate a corresponding value for y^y^:

y^=9376.7+61.8xy^=9376.7+61.8(150)y^=18646.7y^=9376.7+61.8xy^=9376.7+61.8(150)y^=18646.7

This predicted value of yy indicates that the anticipated revenue would be $18,646,700, given the advertising spend of $150,000.

Notice that from past data, there may have been a month where the company actually did spend $150,000 on advertising, and thus the company may have an actual result for the monthly revenue. This actual, or observed, amount can be compared to the prediction from the linear regression model to calculate what is called a residual.

A residual is the difference between an observed y-value and the predicted y-value obtained from the linear regression equation. As an example, assume that in a previous month, the actual monthly revenue for an advertising spend of $150,000 was $19,200,000, and thus y=19,200y=19,200. The residual for this data point can be calculated as follows:

Residual=(observed y-value)(predicted y-value)Residual=yy^Residual=1920018646.7=553.3Residual=(observed y-value)(predicted y-value)Residual=yy^Residual=1920018646.7=553.3

Notice that residuals can be positive, negative, or zero. If the observed y-value exactly matches the predicted y-value, then the residual will be zero. If the observed y-value is greater than the predicted y-value, then the residual will be a positive value. If the observed y-value is less than the predicted y-value, then the residual will be a negative value.

When formulating the best-fit linear regression line to the points on the scatterplot, the mathematical analysis generates a linear equation where the sum of the squared residuals is minimized. This analysis is referred to as the method of least squares. The result is that the analysis generates a linear equation that is the “best fit” to the points on the scatterplot, in the sense that the line minimizes the differences between the predicted values and observed values for yy; this is generally referred to as the best-fit linear equation.

Example 4.23

Problem

Suppose that the chief financial officer of a corporation has created a linear model for the relationship between the company stock and interest rates. When interest rates are at 5%, the company stock has a value of $94. Using the linear model, when interest rates are at 5%, the model predicts the value of the company stock to be $99. Calculate the residual for this data point.

The goal in regression analysis is to determine the coefficients aa and bb in the following regression equation:

y^=a+bxy^=a+bx

Once the (x,y)(x,y) has been collected, the slope (bb) and y-intercept (aa) can be calculated using the following formulas:

b=nΣxy(Σx)(Σy)nΣx2(Σx)2a=ΣynbΣxnb=nΣxy(Σx)(Σy)nΣx2(Σx)2a=ΣynbΣxn

where nn refers to the number of data pairs and ΣxΣx indicates sum of the x-values.

Notice that the formula for the y-intercept requires the use of the slope result (b), and thus the slope should be calculated first, and the y-intercept should be calculated second.

When making predictions for yy, it is always important to plot a scatter diagram first. If the scatterplot indicates that there is a linear relationship between the variables, then it is reasonable to use a best-fit line to make predictions for yy, given xx is within the domain of x-values in the sample data, but not necessarily for x-values outside that domain.

Using Technology for Linear Regression

Typically, technology is used to calculate the best-fit linear model as well as calculate correlation coefficients and scatterplot. Details of using Python for these calculations are provided in Using Python for Correlation and Linear Regression.

Assumptions for Linear Regression

Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between xx and yy data in the sample data provides strong enough evidence that we can conclude that there actually is a linear relationship between xx and yy data in the population.

The regression line equation that we calculate from the sample data gives the best-fit line for our particular sample. We want to use this best-fit line for the sample as an estimate of the best-fit line for the population (Figure 4.8). Examining the scatterplot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.

These are the assumptions underlying the test of significance:

  1. There is a linear relationship in the population that models the average value of yy for varying values of xx. In other words, the expected value of yy for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population.)
  2. The y-values for any particular x-value are normally distributed about the line. This implies that there are more y-values scattered closer to the line than are scattered farther away. Assumption (1) implies that these normal distributions are centered on the line: the means of these normal distributions of y-values lie on the line.
  3. The standard deviations of the population y-values about the line are equal for each value of xx. In other words, each of these normal distributions of y-values has the same shape and spread about the line.
  4. The residual errors are mutually independent (no pattern).
  5. The data are produced from a well-designed random sample or randomized experiment.
Two diagrams of a best fit line. The first diagram labeled a shows a linearly descending line running through the center of three vertical sets of scattered points. The second diagram labeled b shows a linearly descending line running through the mean of three tilted bell curves. The bottom of each bell curve aligns with the position of the three vertical scattered points in diagram a.
Figure 4.8 Best-Fit Line. Note that the y-values for each x-value are normally distributed about the line with the same standard deviation. For each x-value, the mean of the y-values lies on the regression line. More y-values lie near the line than are scattered farther away from the line

Using Python for Correlation and Linear Regression

Once a correlation has been deemed significant, a linear regression model is developed. The goal in the regression analysis is to determine the coefficients aa and bb in the following regression equation:

y^=a+bxy^=a+bx

These formulas can be quite cumbersome, especially for a significant number of data pairs, and thus software is often used (such as Excel, Python, or R).

Develop a Linear Regression Model

In the following example, Python will be used for the following analysis based on a given dataset of (x,y)(x,y) data.

  • Create a scatterplot.
  • Calculate the correlation coefficient.
  • Construct the best-fit linear equation.
  • Predict values of the dependent variable for given values of the independent variable.

There are various ways to accomplish these tasks in Python, but we will use several functions available within the scipy library, such as:

Pearsonr() for correlation coefficient

linregress() for linear regression model

plt.scatter() for scatterplot generation

Example 4.24

Problem

A marketing manager is interested in studying the correlation between the amount spent on advertising and revenue at a certain company. Twelve months of data were collected as shown in Table 4.16. (Dollar amounts are in thousands of dollars.)

Use Python to perform the following analysis:

  1. Generate a scatterplot.
  2. Calculate the correlation coefficient rr.
  3. Determine if the correlation is significant (use level of significance of 0.05).
  4. Construct a linear regression model.
  5. Use the model to predict the revenue for a certain month where the amount spent on advertising is $100.
Month Advertising Spend Revenue
Jan 49 12210
Feb 145 17590
Mar 57 13215
Apr 153 19200
May 92 14600
Jun 83 14100
Jul 117 17100
Aug 142 18400
Sep 69 14100
Oct 106 15500
Nov 109 16300
Dec 121 17020
Table 4.16 Revenue versus Advertising for a Company (in $000s)

Interpret and Apply the Slope and y-Intercept

The slope of the line, bb, describes how changes in the variables are related. It is important to interpret the slope of the line in the context of the situation represented by the data. You should be able to write a sentence interpreting the slope in plain English.

Interpretation of the Slope

The slope of the best-fit line tells us how the dependent variable (yy) changes for every one-unit increase in the independent (xx) variable, on average.

In the previous example, the linear regression model for the monthly amount spent on advertising and the monthly revenue for a company for 12 months was generated as follows:

y^=a+bxy^=9376.7+61.8xy^=a+bxy^=9376.7+61.8x

Since the slope was determined to be 61.8, the company can interpret this to mean that for every $1,000 spent on advertising, on average, this will result in an increase in revenues of $61,800.

The intercept of the regression equation is the corresponding y-value when x=0x=0.

Interpretation of the Intercept

The intercept of the best-fit line tells us the expected mean value of yy in the case where the xx-variable is equal to zero.

However, in many scenarios it may not make sense to have the xx-variable equal zero, and in these cases, the intercept does not have any meaning in the context of the problem. In other examples, the x-value of zero is outside the range of the xx-data that was collected. In this case, we should not assign any interpretation to the y-intercept.

In the previous example, the range of data collected for the xx-variable was from $49,000 to $153,000 spent per month on advertising. Since this interval does not include an x-value of zero, we would not provide any interpretation for the intercept.

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.