R is a statistical analysis tool that is widely used in the data science field. R provides many tools for data exploration, visualization, and statistical analysis. It is available as a free, open-source program and provides an integrated suite of functions for data analysis, graphing, and statistical programming. R is increasingly being used by data scientists as a data analysis and statistical tool in part because R is an open-source language and additional features are constantly being added by the user community. The tool can be used on many different computing platforms and can be downloaded at the R Project website.
Once you have installed and started R on your computer, at the bottom of the R console, you should see the symbol >
, which indicates that R is ready to accept commands.
For a user new to R, typing help.start()
at the R prompt provides a menu of Manuals and Reference materials as shown in Figure B1.
R provides many built-in help resources. When a user types help()
at the R prompt, a listing of help resources are provided. For a specific example, typing help(median)
will show various documentation on the built-in median function within R.
In addition, if a user types demo()
at the R prompt, various demonstration options are shown. For a specific example, typing demo(graphics)
will provide some examples of various graphics plots.
Basic Data Analysis Using R
R is a command-driven language, meaning that the user enters commands at the prompt, which R then executes one at a time. R can also execute a program containing multiple commands. There are ways to add a graphic user interface (GUI) to R. An example of a GUI tool for R is RStudio.
The R command line can be used to perform any numeric calculation, similar to a handheld calculator. For example, to evaluate the expression , enter the following expression at the command line prompt and press return. The numeric result of 31 is then shown:
> 10+3*7 [1] 31
Most calculations in R are handled via functions. For data science and statistical analysis, there are many pre-established functions in R to calculate mean, median, standard deviation, quartiles, and so on. Variables can be named and assigned values using the assignment operator <-.
For example, the following R commands assign the value of 20 to the variable named x and assign the value of 30 to the variable named y:
> x <- 20 > y <- 30
These variable names can be used in any calculation, such as multiplying x by y to produce the result 600:
> x*y [1] 600
The typical method for using functions in statistical applications is to first create a vector of data values. There are several ways to create vectors in R. For example, the c
function is often used to combine values into a vector. The following R command will generate a vector called salaries that contains the data values 40000, 50000, 75000, and 92000:
> salaries <- c(40000, 50000, 75000, 92000)
This vector salaries can then be used in statistical functions such as mean, median, min, max, and so on, as shown:
> mean(salaries) [1] 64250 > median(salaries) [1] 62500 > min(salaries) [1] 40000 > max(salaries) [1] 92000
Another option for generating a vector in R is to use the seq
function, which will automatically generate a sequence of numbers. For example, we can generate a sequence of numbers from 1 to 5, incremented by 0.5, and call this vector example1, as follows:
> example1 <- seq(1, 5, by=0.5)
If we then type the name of the vector and press enter, R will provide a listing of numeric values for that vector name.
> salaries [1] 40000 50000 75000 92000 > example1 [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Oftentimes, a data scientist is interested in generating a quick statistical summary of a dataset in the form of its mean, median, quartiles, min, and max. The R command called summary
provides these results.
> summary(salaries) Min. 1st Qu. Median Mean 3rd Qu. Max. 40000 47500 62500 64250 79250 92000
For measures of spread, R includes a command for standard deviation, called sd()
, and a command for variance, called var()
. The standard deviation and variance are calculated with the assumption that the dataset was collected from a sample.
> sd(salaries) [1] 23641.42 > var(salaries) [1] 558916667
To calculate a weighted mean in R, create two vectors, one of which contains the data values and the other of which contains the associated weights. Then enter the R command weighted.mean(values, weights)
.
The following is an example of a weighted mean calculation in R:
EXAMPLE B.1
Problem
Assume a financial portfolio contains 1,000 shares of XYZ Corporation, purchased on three different dates, as shown in Table B1. Use R to calculate the weighted mean of the purchase price, where the weights are based on the number of shares in the portfolio.
Date Purchased | Purchase Price ($) |
Number of Shares Purchased |
---|---|---|
January 17 | 78 | 200 |
February 10 | 122 | 300 |
March 23 | 131 | 500 |
Total | 1000 |
Solution
Here is how you would create two vectors in R: the price vector will contain the purchase price, and the shares vector will contain the number of shares. Then execute the R command weighted.mean(price, shares)
, as follows:
> price <- c(78, 122, 131) > shares <- c(200, 300, 500) > weighted.mean(price, shares) [1] 117.7
A list of common R statistical commands appears in Table B2.
R Command | Result |
---|---|
mean() |
Calculates the arithmetic mean |
median() |
Calculates the median |
min() |
Calculates the minimum value |
max() |
Calculates the maximum value |
weighted.mean() |
Calculates the weighted mean |
sum() |
Calculates the sum of values |
summary() |
Calculates the mean, median, quartiles, min, and max |
sd() |
Calculates the sample standard deviation |
var() |
Calculates the sample variance |
IQR() |
Calculates the interquartile range |
barplot() |
Plots a bar chart of non-numeric data |
boxplot() |
Plots a boxplot of numeric data |
hist() |
Plots a histogram of numeric data |
plot() |
Plots various graphs, including a scatter plot |
freq() |
Creates a frequency distribution table |
Basic Visualization and Graphing Using R
R provides many built-in functions for data visualization and graphing and allows the data scientist significant flexibility and customization options for graphs and other data visualizations.
There are many statistical applications in R, and many graphical representations are possible, such as bar graphs, histograms, time series plots, scatter plots, and others.
As a simple example of a bar graph, assume a college instructor wants to create a bar graph to show enrollment in various courses such as statistics, history, physics, and chemistry courses.
Table B3 shows the enrollment data:
College Course | Student Enrollment |
---|---|
Statistics | 375 |
History | 302 |
Physics | 294 |
Chemistry | 193 |
The basic command to create a bar graph in R is the command barchart()
.
First, create a dataframe called enrollment to hold the data (a dataframe can be considered a table or matrix to store the dataset).
enrollment <- data.frame(course=c("Statistics", "History", "Physics", "Chemistry"),
enrolled=c(375, 302, 294, 193))
Next, use the barplot
function to create the bar graph and add labels for x-axis, y-axis, and overall title.
barplot(enrollment$enrolled, names.arg=enrollment$course,
main="Student Enrollment at the College", xlab="Course Name",
ylab = "Enrollment")
The resulting output is shown below in Figure B2:
The basic command to create a scatter plot in R is the plot command, plot(x, y)
, where x is a vector containing the x-values of the dataset and y is a vector containing the y-values of the dataset.
The general format of the command is as follows:
>plot(x, y, main="text for title of graph",
xlab="text for x-axis label", ylab="text for y-axis label")
For example, we are interested in creating a scatter plot to examine the correlation between the value of the S&P 500 and Nike stock prices. Assume we have the data shown in Table B4, collected over a one-year time period.
Date | S&P 500 | Nike Stock Price ($) |
---|---|---|
4/1/2020 | 2912.43 | 87.18 |
5/1/2020 | 3044.31 | 98.58 |
6/1/2020 | 3100.29 | 98.05 |
7/1/2020 | 3271.12 | 97.61 |
8/1/2020 | 3500.31 | 111.89 |
9/1/2020 | 3363.00 | 125.54 |
10/1/2020 | 3269.96 | 120.08 |
11/1/2020 | 3621.63 | 134.70 |
12/1/2020 | 3756.07 | 141.47 |
1/1/2021 | 3714.24 | 133.59 |
2/1/2021 | 3811.15 | 134.78 |
3/1/2021 | 3943.34 | 140.45 |
3/12/2021 | 3943.34 | 140.45 |
(source: https://finance.yahoo.com/) |
Note that data can be read into R from a text file or Excel file or from the clipboard by using various R commands. Assume the values of the S&P 500 have been loaded into the vector SP500 and the values of Nike stock prices have been loaded into the vector Nike. Then, to generate the scatter plot, we can use the following R command:
>plot(SP500, Nike, main="Scatter Plot of Nike Stock Price vs. S&P 500",
xlab="S&P 500", ylab="Nike Stock Price")
As a result of these commands, R provides the scatter plot shown in Figure B3.
R Analysis for Probability Distributions
As mentioned earlier, a quick statistical summary can be generated using the summary()
command in R, where the usage is summary(data_vector)
.
R has extensive built-in libraries for calculating confidence intervals, hypothesis testing, and working with various probability distributions such as binomial, Poisson, normal, chi-square, and other probability distributions.
As discussed in Measures of Center, data scientists are often interested in various probability distributions such as the normal distribution, binomial distribution, and Poisson distribution. Excel provides built-in functions to analyze many probability distributions.
R uses the pnorm
command to find the area under the normal curve to the left of a specified value:
Usage: pnorm(x_value, mean, standard_deviation)
where pnorm
returns the probability that a random variable having a given mean and standard deviation is less than x_value
.
EXAMPLE B.2
Problem
Birth weights for newborns in the United States are normally distributed with mean of 3,400 grams and standard deviation of 500 grams.
- Use R to find the probability that a random newborn infant weighs less than 4,000 grams.
- Use R to find the probability that a random newborn infant weighs more than 3,000 grams.
Solution
For part (a), use the pnorm
command as follows:
pnorm(4000, 3400, 500)
which returns the probability result of:
[1] 0.8849303
For part (b), an option on the pnorm
command “lower.tail=FALSE
” can calculate the area to the right of a given x-value:
pnorm(3000, 3400, 500, lower.tail=FALSE)
which returns the probability result of:
[1] 0.7881446
R also provides a built-in function for the binomial distribution as follows:
Binomial distribution:
Usage: pbinom(k, n, p)
where n is the number of trials, p is the probability of success, and k is the number of successes for which the probability is desired.
EXAMPLE B.3
Problem
A data scientist conducts a survey for a sample of 20 people and asks the survey question: “Did you find the website for ABC corporation easy to navigate?” From past data, the probability that a random person found the website easy to navigate was 65%. Use R to find the probability that 13 out of the 20 respond that they find the website easy to navigate.
Solution
Use the pbinom
command as follows:
pbinom(13, 20, 0.65)
which returns the probability result of:
[1] 0.5833746
R also provides a built-in function for the binomial distribution as follows:
Binomial distribution:
Usage: dbinom(k, n, p)
where n is the number of trials, p is the probability of success, and k is the number of successes for which the probability is desired.
EXAMPLE B.4
Problem
A data scientist conducts a survey for a sample of 20 people and asks the survey question: “Did you find the website for ABC corporation easy to navigate?” From past data, the probability that a random person found the website easy to navigate was 65%. Use R to find the probability that 13 out of the 20 responds that they find the website easy to navigate.
Solution
Use the dbinom
command as follows:
dbinom(13, 20, 0.65)
which returns the probability result of:
[1] 0.1844012
R also provides a built-in function for the Poisson distribution as follows:
Poisson distribution:
Usage: dpois(k, mu)
where mu is the mean of the Poisson distribution and k is the number of successes for which the probability is desired.
EXAMPLE B.5
Problem
A traffic engineer investigates a certain intersection that has an average of 3 accidents per month. Use R to find the probability of 5 accidents in a given month.
Solution
Use the ppois
command as follows:
dpois(5, 3)
which returns the probability result of:
[1] 0.1008188
Basic Correlation and Regression Analysis Using R
Recall in Inferential Statistics and Regression Analysis the discussion on correlation and regression analysis. A first step in correlation analysis is to calculate the correlation coefficient (r) for (x, y) data. R provides a built-in function cor()
to calculate the correlation coefficient for bivariate data.
As an example, consider the dataset in Table B5 that tracks the return on the S&P 500 versus return on Coca-Cola stock for a seven-month time period.
Month | S&P 500 Monthly Return (%) | Coca-Cola Monthly Return (%) |
---|---|---|
Jan | 8 | 6 |
Feb | 1 | 0 |
Mar | 0 | -2 |
Apr | 2 | 1 |
May | -3 | -1 |
Jun | 7 | 8 |
Jul | 4 | 2 |
To calculate the correlation coefficient for this dataset, first create two vectors in R, one vector for the S&P 500 returns and a second vector for Coca-Cola returns:
> SP500 <- c(8,1,0,2,-3,7,4) > CocaCola <- c(6,0,-2,1,-1,8,2)
The R command called cor
returns the correlation coefficient for the x-data vector and y-data vector:
> cor(SP500, CocaCola) [1] 0.9123872
Thus the correlation coefficient for this dataset is approximately 0.912.
Linear Regression Models Using R
To create a linear model in R, assuming the correlation is significant, the command lm()
(for linear model) will provide the slope and y-intercept for the linear regression equation.
The format of the R command is
lm(dependent_variable_vector ~ independent_variable_vector)
Notice the use of the tilde symbol as the separator between the dependent variable vector and the independent variable vector.
We use the returns on Coca-Cola stock as the dependent variable and the returns on the S&P 500 as the independent variable, and thus the R command would be
> lm(CocaCola ~ SP500) Call: lm(formula = CocaCola ~ SP500) Coefficients: (Intercept) SP500 -0.3453 0.8641
The R output provides the value of the y-intercept as -0.3453 and the value of the slope as 0.8641. Based on this, the linear model would be
where x represents the value of S&P 500 return and y represents the value of Coca-Cola stock return.
The results can also be saved as a formula and called “model” using the following R command. To obtain more detailed results for the linear regression, the summary
command can be used, as follows:
> model <- lm(CocaCola ~ SP500) > summary(model) Call: lm(formula = CocaCola ~ SP500) Residuals: 1 2 3 4 5 6 7 -0.5672 -0.5188 -1.6547 -0.3828 1.9375 2.2969 -1.1109 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.3453 0.7836 -0.441 0.67783 SP500 0.8641 0.1734 4.984 0.00416 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.658 on 5 degrees of freedom Multiple R-squared: 0.8325, Adjusted R-squared: 0.7989 F-statistic: 24.84 on 1 and 5 DF, p-value: 0.004161
In this output, the y-intercept and slope is given, as well as the residuals for each x-value. The output includes additional statistical details regarding the regression analysis.
Predicted values and prediction intervals can also be generated within R. First, we can create a structure in R called a dataframe to hold the values of the independent variable for which we want to generate a prediction. For example, we would like to generate the predicted return for Coca-Cola stock, given that the return for the S&P 500 is 6.
We use the R command called predict()
.
To generate a prediction for the linear regression equation called model, using the dataframe where the value of the S&P 500 is 6, the R commands will be
> a <- data.frame(SP500=6) > predict(model, a) 1 4.839062
The output from the predict
command indicates that the predicted return for Coca-Cola stock will be 4.8% when the return for the S&P 500 is 6%.
We can extend this analysis to generate a 95% prediction interval for this result by using the following R command, which adds an option to the predict
command to generate a prediction interval:
> predict(model, a, interval="predict") fit lwr upr 1 4.839062 0.05417466 9.62395
Thus the 95% prediction interval for Coca-Cola return is (0.05%, 9.62%) when the return for the S&P 500 is 6%.
Multiple Regression Models Using R
R also includes many tools to allow the data scientist to conduct multiple regression, where a dependent variable is predicted based on more than one independent variable. For example, we might arrive at a better prediction model for monthly return of Coca-Cola stock if we consider not only the S&P500 monthly return but also take into account the monthly sales of Coca-Cola products as well.
Here are several examples where a multiple regression model might provide an improved prediction model as compared to a regression model with only one independent variable:
- Employee salaries can be predicted based on years of experience and education level.
- Housing prices can be predicted based on square footage of a home, number of bedrooms, and number of bathrooms.
The general form of the multiple regression model is:
where:
are the independent variables,
are the coefficients where each coefficient is the amount of change in y when the independent variable is changed by one unit and all other independent variables are held constant,
a is the y-intercept, which is the value of y when all .
Recall from an earlier example that the format for linear regression analysis in R when there is only one independent variable looked like the following:
> model <- lm(y ~ x)
where y is the dependent variable and x is the independent variable.
To “add in” additional independent variables for the multiple regression approach, we use a format as follows:
> model <- lm(y ~ x1 + x2 + x3)
where are the independent variables.
Example:
Use R to create a multiple regression model to predict the price of a home based on the independent variables of square footage and number of bedrooms based on the following dataset. Then use the multiple regression model to predict the price of a home with 3,000 square feet and 3 bedrooms (see Table B6).
Price of Home | Square Footage | Number of Bedrooms |
---|---|---|
466000 | 4668 | 6 |
355000 | 3196 | 5 |
405900 | 3998 | 5 |
415000 | 4022 | 5 |
206000 | 1834 | 2 |
462000 | 4668 | 6 |
290000 | 2650 | 3 |
First, create three vectors in R, one vector each for home price, square footage and number of bedrooms:
> price <- c(466000, 355000, 405900, 415000, 206000, 462000, 290000) > square_footage <- c(4668, 3196, 3998, 4022, 1834, 4668, 2650) > bedrooms <- c(6, 5, 5, 5, 2, 6, 3)
Next, run the multiple regression model using the lm
command:
> model <- lm(price ~ square_footage + bedrooms) > summary(model) Call: lm(formula = price ~ square_footage + bedrooms) Residuals: 1 2 3 4 5 6 7 -1704.6 1438.4 -606.9 6908.7 -6747.4 -5704.6 6416.5 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 57739.887 9529.693 6.059 0.00375 ** square_footage 66.017 8.924 7.398 0.00178 ** bedrooms 16966.536 6283.183 2.700 0.05408 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6563 on 4 degrees of freedom Multiple R-squared: 0.9968, Adjusted R-squared: 0.9953 F-statistic: 630.6 on 2 and 4 DF, p-value: 9.996e-06
In the R output, note the column called “Estimate” provides the estimates for the coefficients and the y-intercept.
The y-intercept is given as 57740 (rounding to nearest whole number).
The coefficient for the “square footage” variable is given as 66.
The coefficient for the “bedrooms” variance is given as 16967.
Based on these values, the multiple regression model is:
We can now use the multiple regression model to predict the price of a home with 3,000 square feet and 3 bedrooms by setting and , as follows:
Thus, the predicted price of a home with 3,000 square feet and 3 bedrooms is $306,641.