Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

R is a statistical analysis tool that is widely used in the data science field. R provides many tools for data exploration, visualization, and statistical analysis. It is available as a free, open-source program and provides an integrated suite of functions for data analysis, graphing, and statistical programming. R is increasingly being used by data scientists as a data analysis and statistical tool in part because R is an open-source language and additional features are constantly being added by the user community. The tool can be used on many different computing platforms and can be downloaded at the R Project website.

Once you have installed and started R on your computer, at the bottom of the R console, you should see the symbol >, which indicates that R is ready to accept commands.

For a user new to R, typing help.start() at the R prompt provides a menu of Manuals and Reference materials as shown in Figure B1.

Screenshot of the R Statistical Analysis Help menu. Menu categories include Manuals, Reference, and Miscellaneous Material.

Figure B1 R Help Menu Based on

help.start()

R provides many built-in help resources. When a user types help() at the R prompt, a listing of help resources are provided. For a specific example, typing help(median) will show various documentation on the built-in median function within R.

In addition, if a user types demo() at the R prompt, various demonstration options are shown. For a specific example, typing demo(graphics) will provide some examples of various graphics plots.

Basic Data Analysis Using R

R is a command-driven language, meaning that the user enters commands at the prompt, which R then executes one at a time. R can also execute a program containing multiple commands. There are ways to add a graphic user interface (GUI) to R. An example of a GUI tool for R is RStudio.

The R command line can be used to perform any numeric calculation, similar to a handheld calculator. For example, to evaluate the expression $10 + 3 \cdot 7$ , enter the following expression at the command line prompt and press return. The numeric result of 31 is then shown:

    > 10+3*7

    [1] 31

Most calculations in R are handled via functions. For data science and statistical analysis, there are many pre-established functions in R to calculate mean, median, standard deviation, quartiles, and so on. Variables can be named and assigned values using the assignment operator <-. For example, the following R commands assign the value of 20 to the variable named x and assign the value of 30 to the variable named y:

    > x <- 20
    
    > y <- 30

These variable names can be used in any calculation, such as multiplying x by y to produce the result 600:

    > x*y
    
    [1] 600

The typical method for using functions in statistical applications is to first create a vector of data values. There are several ways to create vectors in R. For example, the c function is often used to combine values into a vector. The following R command will generate a vector called salaries that contains the data values 40000, 50000, 75000, and 92000:

    > salaries <- c(40000, 50000, 75000, 92000)

This vector salaries can then be used in statistical functions such as mean, median, min, max, and so on, as shown:

    > mean(salaries)
    
    [1] 64250
    
    > median(salaries)
    
    [1] 62500
    
    > min(salaries)
    
    [1] 40000
    
    > max(salaries)
    
    [1] 92000

Another option for generating a vector in R is to use the seq function, which will automatically generate a sequence of numbers. For example, we can generate a sequence of numbers from 1 to 5, incremented by 0.5, and call this vector example1, as follows:

    > example1 <- seq(1, 5, by=0.5)

If we then type the name of the vector and press enter, R will provide a listing of numeric values for that vector name.

    > salaries
    
    [1] 40000 50000 75000 92000
    
    > example1
    
    [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Oftentimes, a data scientist is interested in generating a quick statistical summary of a dataset in the form of its mean, median, quartiles, min, and max. The R command called summary provides these results.

    > summary(salaries)
    
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    
       40000   47500   62500   64250   79250   92000

For measures of spread, R includes a command for standard deviation, called sd(), and a command for variance, called var(). The standard deviation and variance are calculated with the assumption that the dataset was collected from a sample.

    > sd(salaries)
    
    [1] 23641.42
    
    > var(salaries)
    
    [1] 558916667

To calculate a weighted mean in R, create two vectors, one of which contains the data values and the other of which contains the associated weights. Then enter the R command weighted.mean(values, weights).

The following is an example of a weighted mean calculation in R:

EXAMPLE B.1

Problem

Assume a financial portfolio contains 1,000 shares of XYZ Corporation, purchased on three different dates, as shown in Table B1. Use R to calculate the weighted mean of the purchase price, where the weights are based on the number of shares in the portfolio.

Date Purchased	Purchase Price ($)	Number of Shares Purchased
January 17	78	200
February 10	122	300
March 23	131	500
Total		1000

Table B1 Portfolio of XYZ Shares

Solution

Here is how you would create two vectors in R: the price vector will contain the purchase price, and the shares vector will contain the number of shares. Then execute the R command weighted.mean(price, shares), as follows:

    > price <- c(78, 122, 131)
    
    > shares <- c(200, 300, 500)
    
    > weighted.mean(price, shares)
    
    [1] 117.7

A list of common R statistical commands appears in Table B2.

R Command	Result
mean()	Calculates the arithmetic mean
median()	Calculates the median
min()	Calculates the minimum value
max()	Calculates the maximum value
weighted.mean()	Calculates the weighted mean
sum()	Calculates the sum of values
summary()	Calculates the mean, median, quartiles, min, and max
sd()	Calculates the sample standard deviation
var()	Calculates the sample variance
IQR()	Calculates the interquartile range
barplot()	Plots a bar chart of non-numeric data
boxplot()	Plots a boxplot of numeric data
hist()	Plots a histogram of numeric data
plot()	Plots various graphs, including a scatter plot
freq()	Creates a frequency distribution table

Table B2 List of Common R Statistical Commands

Basic Visualization and Graphing Using R

R provides many built-in functions for data visualization and graphing and allows the data scientist significant flexibility and customization options for graphs and other data visualizations.

There are many statistical applications in R, and many graphical representations are possible, such as bar graphs, histograms, time series plots, scatter plots, and others.

As a simple example of a bar graph, assume a college instructor wants to create a bar graph to show enrollment in various courses such as statistics, history, physics, and chemistry courses.

Table B3 shows the enrollment data:

College Course	Student Enrollment
Statistics	375
History	302
Physics	294
Chemistry	193

Table B3 Enrollment Data

The basic command to create a bar graph in R is the command barchart().

First, create a dataframe called enrollment to hold the data (a dataframe can be considered a table or matrix to store the dataset).

enrollment <- data.frame(course=c("Statistics", "History", "Physics", "Chemistry"),
 enrolled=c(375, 302, 294, 193))

Next, use the barplot function to create the bar graph and add labels for x-axis, y-axis, and overall title.

barplot(enrollment$enrolled, names.arg=enrollment$course,
 main="Student Enrollment at the College", xlab="Course Name",
 ylab = "Enrollment")

The resulting output is shown below in Figure B2:

A bar graph labeled “Student Enrollment at the College.” The X axis is labeled “Course Name” and includes from left to right statistics, history, physics, and chemistry. The Y axis is labeled “enrollment” and ranges from 0 to 400 with the interval of 50. Blue bars indicating enrollment decline from left to right, decreasing from about 375 to 300 to 290 to 180.

Figure B2 Bar Graph of Student Enrollment Data

The basic command to create a scatter plot in R is the plot command, plot(x, y), where x is a vector containing the x-values of the dataset and y is a vector containing the y-values of the dataset.

The general format of the command is as follows:

>plot(x, y, main="text for title of graph",
 xlab="text for x-axis label", ylab="text for y-axis label")

For example, we are interested in creating a scatter plot to examine the correlation between the value of the S&P 500 and Nike stock prices. Assume we have the data shown in Table B4, collected over a one-year time period.

Date	S&P 500	Nike Stock Price ($)
4/1/2020	2912.43	87.18
5/1/2020	3044.31	98.58
6/1/2020	3100.29	98.05
7/1/2020	3271.12	97.61
8/1/2020	3500.31	111.89
9/1/2020	3363.00	125.54
10/1/2020	3269.96	120.08
11/1/2020	3621.63	134.70
12/1/2020	3756.07	141.47
1/1/2021	3714.24	133.59
2/1/2021	3811.15	134.78
3/1/2021	3943.34	140.45
3/12/2021	3943.34	140.45
(source: https://finance.yahoo.com/)

Table B4 Data for S&P 500 and Nike Stock Price over a 12-Month Period

Note that data can be read into R from a text file or Excel file or from the clipboard by using various R commands. Assume the values of the S&P 500 have been loaded into the vector SP500 and the values of Nike stock prices have been loaded into the vector Nike. Then, to generate the scatter plot, we can use the following R command:

>plot(SP500, Nike, main="Scatter Plot of Nike Stock Price vs. S&P 500",
 xlab="S&P 500", ylab="Nike Stock Price")

As a result of these commands, R provides the scatter plot shown in Figure B3.

A scatterplot of 12 data points generated by R for Nike Stock Price ($) on the Y axis vs. S&P 500 on the X axis. The Y axis ranges from 80 to 150 with the interval of 10 and the X axis ranges from 2,800 to 4,000. The overall scatter plot shows that S&P 500 is positively correlated with Nike Stock Price (in $).

Figure B3 Scatter Plot Generated by R for Nike Stock Price vs. S&P 500

R Analysis for Probability Distributions

As mentioned earlier, a quick statistical summary can be generated using the summary() command in R, where the usage is summary(data_vector).

R has extensive built-in libraries for calculating confidence intervals, hypothesis testing, and working with various probability distributions such as binomial, Poisson, normal, chi-square, and other probability distributions.

As discussed in Measures of Center, data scientists are often interested in various probability distributions such as the normal distribution, binomial distribution, and Poisson distribution. Excel provides built-in functions to analyze many probability distributions.

R uses the pnorm command to find the area under the normal curve to the left of a specified value:

Usage: pnorm(x_value, mean, standard_deviation)

where pnorm returns the probability that a random variable having a given mean and standard deviation is less than x_value.

EXAMPLE B.2

Problem

Birth weights for newborns in the United States are normally distributed with mean of 3,400 grams and standard deviation of 500 grams.

Use R to find the probability that a random newborn infant weighs less than 4,000 grams.
Use R to find the probability that a random newborn infant weighs more than 3,000 grams.

Solution

For part (a), use the pnorm command as follows:

pnorm(4000, 3400, 500)

which returns the probability result of:

[1] 0.8849303

For part (b), an option on the pnorm command “lower.tail=FALSE” can calculate the area to the right of a given x-value:

pnorm(3000, 3400, 500, lower.tail=FALSE)

which returns the probability result of:

[1] 0.7881446

R also provides a built-in function for the binomial distribution as follows:

Binomial distribution:

Usage: pbinom(k, n, p)

where n is the number of trials, p is the probability of success, and k is the number of successes for which the probability is desired.

EXAMPLE B.3

Problem

A data scientist conducts a survey for a sample of 20 people and asks the survey question: “Did you find the website for ABC corporation easy to navigate?” From past data, the probability that a random person found the website easy to navigate was 65%. Use R to find the probability that 13 out of the 20 respond that they find the website easy to navigate.

Solution

Use the pbinom command as follows:

pbinom(13, 20, 0.65)

which returns the probability result of:

[1] 0.5833746

R also provides a built-in function for the binomial distribution as follows:

Binomial distribution:

Usage: dbinom(k, n, p)

where n is the number of trials, p is the probability of success, and k is the number of successes for which the probability is desired.

EXAMPLE B.4

Problem

A data scientist conducts a survey for a sample of 20 people and asks the survey question: “Did you find the website for ABC corporation easy to navigate?” From past data, the probability that a random person found the website easy to navigate was 65%. Use R to find the probability that 13 out of the 20 responds that they find the website easy to navigate.

Solution

Use the dbinom command as follows:

dbinom(13, 20, 0.65)

which returns the probability result of:

[1] 0.1844012

R also provides a built-in function for the Poisson distribution as follows:

Poisson distribution:

Usage: dpois(k, mu)

where mu is the mean of the Poisson distribution and k is the number of successes for which the probability is desired.

EXAMPLE B.5

Problem

A traffic engineer investigates a certain intersection that has an average of 3 accidents per month. Use R to find the probability of 5 accidents in a given month.

Solution

Use the ppois command as follows:

dpois(5, 3)

which returns the probability result of:

[1] 0.1008188

Basic Correlation and Regression Analysis Using R

Recall in Inferential Statistics and Regression Analysis the discussion on correlation and regression analysis. A first step in correlation analysis is to calculate the correlation coefficient (r) for (x, y) data. R provides a built-in function cor() to calculate the correlation coefficient for bivariate data.

As an example, consider the dataset in Table B5 that tracks the return on the S&P 500 versus return on Coca-Cola stock for a seven-month time period.

Month	S&P 500 Monthly Return (%)	Coca-Cola Monthly Return (%)
Jan	8	6
Feb	1	0
Mar	0	-2
Apr	2	1
May	-3	-1
Jun	7	8
Jul	4	2

Table B5 Monthly Returns of Coca-Cola Stock versus Monthly Returns for the S&P 500

To calculate the correlation coefficient for this dataset, first create two vectors in R, one vector for the S&P 500 returns and a second vector for Coca-Cola returns:

    > SP500 <- c(8,1,0,2,-3,7,4)
    
    > CocaCola <- c(6,0,-2,1,-1,8,2)

The R command called cor returns the correlation coefficient for the x-data vector and y-data vector:

    > cor(SP500, CocaCola)
    
    [1] 0.9123872

Thus the correlation coefficient for this dataset is approximately 0.912.

Linear Regression Models Using R

To create a linear model in R, assuming the correlation is significant, the command lm() (for linear model) will provide the slope and y-intercept for the linear regression equation.

The format of the R command is

lm(dependent_variable_vector ~ independent_variable_vector)

Notice the use of the tilde symbol as the separator between the dependent variable vector and the independent variable vector.

We use the returns on Coca-Cola stock as the dependent variable and the returns on the S&P 500 as the independent variable, and thus the R command would be

    > lm(CocaCola ~ SP500)
    
    Call:
    
    lm(formula = CocaCola ~ SP500)
    
    Coefficients:
    
    (Intercept)    SP500
    
       -0.3453   0.8641

The R output provides the value of the y-intercept as -0.3453 and the value of the slope as 0.8641. Based on this, the linear model would be

\begin{array}{rcl} \hat{y} & = & a + b x \\ \hat{y} & = & - 0.3453 + 0.8641 x \end{array}

where x represents the value of S&P 500 return and y represents the value of Coca-Cola stock return.

The results can also be saved as a formula and called “model” using the following R command. To obtain more detailed results for the linear regression, the summary command can be used, as follows:

    > model <- lm(CocaCola ~ SP500)
    
    > summary(model)
    
    Call:
    
    lm(formula = CocaCola ~ SP500)
    
    Residuals:
    
          1    2    3    4    5    6    7
    
    -0.5672 -0.5188 -1.6547 -0.3828 1.9375 2.2969 -1.1109
    
    Coefficients:
    
            Estimate Std. Error t value Pr(>|t|)
    
    (Intercept) -0.3453   0.7836 -0.441 0.67783
    
    SP500     0.8641   0.1734  4.984 0.00416 **
    ---
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 1.658 on 5 degrees of freedom
    
    Multiple R-squared: 0.8325,  Adjusted R-squared: 0.7989
    
    F-statistic: 24.84 on 1 and 5 DF, p-value: 0.004161

In this output, the y-intercept and slope is given, as well as the residuals for each x-value. The output includes additional statistical details regarding the regression analysis.

Predicted values and prediction intervals can also be generated within R. First, we can create a structure in R called a dataframe to hold the values of the independent variable for which we want to generate a prediction. For example, we would like to generate the predicted return for Coca-Cola stock, given that the return for the S&P 500 is 6.

We use the R command called predict().

To generate a prediction for the linear regression equation called model, using the dataframe where the value of the S&P 500 is 6, the R commands will be

    > a <- data.frame(SP500=6)
    
    > predict(model, a)
    
         1
    
    4.839062

The output from the predict command indicates that the predicted return for Coca-Cola stock will be 4.8% when the return for the S&P 500 is 6%.

We can extend this analysis to generate a 95% prediction interval for this result by using the following R command, which adds an option to the predict command to generate a prediction interval:

    > predict(model, a, interval="predict")
    
      fit    lwr   upr
    
    1 4.839062 0.05417466 9.62395

Thus the 95% prediction interval for Coca-Cola return is (0.05%, 9.62%) when the return for the S&P 500 is 6%.

Multiple Regression Models Using R

R also includes many tools to allow the data scientist to conduct multiple regression, where a dependent variable is predicted based on more than one independent variable. For example, we might arrive at a better prediction model for monthly return of Coca-Cola stock if we consider not only the S&P500 monthly return but also take into account the monthly sales of Coca-Cola products as well.

Here are several examples where a multiple regression model might provide an improved prediction model as compared to a regression model with only one independent variable:

Employee salaries can be predicted based on years of experience and education level.
Housing prices can be predicted based on square footage of a home, number of bedrooms, and number of bathrooms.

The general form of the multiple regression model is:

\hat{y} = a + b_{1} x_{1} + b_{2} x_{2} + b_{3} x_{3} + ? + b_{n} x_{n}

where:

$x_{1}, x_{2}, x_{3}, \dots, x_{n}$ are the independent variables,
$b_{1}, b_{2}, b_{3}, \dots, b_{n}$ are the coefficients where each coefficient is the amount of change in y when the independent variable $x_{i}$ is changed by one unit and all other independent variables are held constant,
a is the y-intercept, which is the value of y when all $x_{i} = 0$ .

Recall from an earlier example that the format for linear regression analysis in R when there is only one independent variable looked like the following:

> model <- lm(y ~ x)

where y is the dependent variable and x is the independent variable.

To “add in” additional independent variables for the multiple regression approach, we use a format as follows:

> model <- lm(y ~ x1 + x2 + x3)

where $x_{1}, x_{2}, x_{3}$ are the independent variables.

Example:

Use R to create a multiple regression model to predict the price of a home based on the independent variables of square footage and number of bedrooms based on the following dataset. Then use the multiple regression model to predict the price of a home with 3,000 square feet and 3 bedrooms (see Table B6).

Price of Home $(y)$	Square Footage $x_{1}$	Number of Bedrooms $x_{2}$
466000	4668	6
355000	3196	5
405900	3998	5
415000	4022	5
206000	1834	2
462000	4668	6
290000	2650	3

Table B6 Home Prices Based on Square Footage and Number of Bedrooms

First, create three vectors in R, one vector each for home price, square footage and number of bedrooms:

    > price <- c(466000, 355000, 405900, 415000, 206000, 462000, 290000)
    > square_footage <- c(4668, 3196, 3998, 4022, 1834, 4668, 2650)
    > bedrooms <- c(6, 5, 5, 5, 2, 6, 3)

Next, run the multiple regression model using the lm command:

    > model <- lm(price ~ square_footage + bedrooms)
    > summary(model)
    
    Call:
    lm(formula = price ~ square_footage + bedrooms)
    
    Residuals:
           1       2       3       4       5       6       7
    -1704.6  1438.4  -606.9  6908.7 -6747.4 -5704.6  6416.5
    
    Coefficients:
                           Estimate Std. Error t value Pr(>|t|)
    (Intercept)    57739.887   9529.693   6.059  0.00375 **
    square_footage    66.017      8.924   7.398  0.00178 **
    bedrooms       16966.536   6283.183   2.700  0.05408 .
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 6563 on 4 degrees of freedom
    Multiple R-squared:  0.9968,    Adjusted R-squared:  0.9953
    F-statistic: 630.6 on 2 and 4 DF,  p-value: 9.996e-06

In the R output, note the column called “Estimate” provides the estimates for the coefficients and the y-intercept.

The y-intercept is given as 57740 (rounding to nearest whole number).

The coefficient for the “square footage” variable is given as 66.

The coefficient for the “bedrooms” variance is given as 16967.

Based on these values, the multiple regression model is:

\hat{y} = 57740 + 66 x_{1} + 16967 x_{2}

We can now use the multiple regression model to predict the price of a home with 3,000 square feet and 3 bedrooms by setting $x_{1} = 3000$ and $x_{2} = 3$ , as follows:

\begin{array}{rcl} \hat{y} & = & 57740 + 66 (3000) + 16967 (3) \\ \hat{y} & = & 306641 \end{array}

Thus, the predicted price of a home with 3,000 square feet and 3 bedrooms is $306,641.

B | Appendix B: Review of R Studio for Data Science

Basic Data Analysis Using R

Basic Visualization and Graphing Using R

R Analysis for Probability Distributions

Basic Correlation and Regression Analysis Using R

Linear Regression Models Using R

Multiple Regression Models Using R