Microsoft® Excel® is a popular spreadsheet program developed by Microsoft Corporation. Excel is one of the first and simplest spreadsheet tools developed to help with the manipulation and analysis of data. It includes a range of functionalities that cater to the diverse needs of many data science applications. You are likely to have used Excel in some form or other—perhaps to organize the possible roommate combinations in your dorm room or to plan for a party or in some instructional context.
Excel was first released by Microsoft in 1987, and it has become the most popular choice for loading and analyzing tabular data.
Basic Data Analysis Using Excel
A screenshot of the main Excel menu when it is open is provided in Figure A1. As you see, Excel consists of a series of submenus such as Home, Insert, Draw, Page Layout, Formulas, etc. Clicking on each Menu item opens a submenu with additional functions. For example, clicking on the Formulas menu item opens a submenu as shown in Figure A2.
Excel presents data in the form of table with rows and columns. We call each rectangle, defined by the combination of row number and column name (e.g., A1, B3), a cell. We can use a mouse pointer to select a specific cell and then edit the data inside. Once a cell is selected, Namebox shows the identifier of the selected cell Figure A3.
Formula Bar is used to read/write a formula for a cell, which is identified by Namebox.
Load Data Using Excel
Loading a dataset to Excel can be done simply by clicking File > Open File. That will open up a file explorer window, and you can simply select the CSV file you want to load. We are going to use a part of a public dataset, titled “SOCR- 1035 Records of Heights (in) and Weights (lbs) of Major League Baseball (MLB) Players” (ch1-SOCR-small.csv).
Once loaded, the dataset will show up on the Worksheet Window, as shown in Figure A4.
Summarize Data Using Excel
We will continue to refer to the same dataset about the MLB players. For simplicity, only 12 items of SD Outfielders and Pitchers are included in Figure A5.
EXAMPLE A.1
Problem
On ch1-SOCR-small.csv, write a formula that counts the number of Outfielders.
Solution
The formula is
=COUNTIF(C2:C13, "Outfielder")”
This will count the rows whose position, located at column C, is “Outfielder.” Note that double quotation marks (i.e., “ ”) are used to indicate “Outfielder” is a string. The result is 7.
Search Data Using Excel
You can search data in a table in multiple ways with Excel. We will first explore ways without using Excel formulas, and then we’ll look at some useful formulas.
You can search for the cells containing a certain number or text by using “Find,” as in Figure A6. The Find pop-up window will have a text box in which you can type in any number or text, and then Excel will search and find all the cells that include number/text you typed in.
If you are more interested in viewing all the rows that fulfill certain criteria, you can use “Filter,” as in Figure A7. Once Filter is enabled, you will see drop-down arrows on each column of the table. The arrows allow you to select specific values that you are searching for and then presents only the rows with such values.
For example, if you click “Outfielder” in the Position window, the resulting table will only have the rows whose Position is Outfielder, as shown in Figure A8.
Now let’s do some more sophisticated searches using some Excel formulas.
INDEX
Index returns a cell value at a specific location, given a range of cells. The location is identified as a relative row and column number within the range. For example, the following formula on ch1-SOCR-small.csv will return 72 since the cell located at the fifth row and second column within the range of C8:E13 will refer to the row 12 and column D—so the cell D12. The value at D12 is 72.
=INDEX(C8:E13,5,2)
EXAMPLE A.2
Problem
On ch1-SOCR-small.csv, what value does the following formula return?
=INDEX(A3:D9,7,3)
Solution
This formula finds the seventh row and third column within the range of A3:D9, which indicates C9. Thus, the resulting value is “Outfielder.”
MATCH
Given a range of single-column cells, MATCH returns the relative row number at which the cell value matches the value you specify. MATCH takes three arguments—the value to match, the range of search, and the match type. This example illustrates how to locate the value 75 within the cell D2, D3, …, through D13. The third argument, match type, is set to 0 so that it seeks for an exact match (i.e., exactly 75, not 74 or 75.5).
=MATCH(75, D2:D13, 0)
The formula above returns 5, a relative row number. The relative row number 5 indicates the fifth row within the range, and this corresponds to the absolute row number 6. Notice that D6 has 75.
What happens if there are multiple matching rows?
If there are multiple rows with a match, MATCH only returns the very first instance. For example, this formula will simply return 2 (i.e., the second row within D2:D13, which is D3) even though there are three occurrences of 72 at rows 3, 5, and 12.
=MATCH(72, D2:D13, 0)
EXAMPLE A.3
Problem
Using MATCH, find the absolute row number of a first player whose weight is 185 lb on ch1-SOCR-small.csv.
Solution
Weight is located in column E. The correct use of MATCH is provided next. It looks through E2, E3, …, E13 and finds the first exact match of 185. The first occurrence of 185 is in the second row in the range E2:E13, which is E3, so this formula will return 2.
=MATCH(185, E2:E13, 0)
To get the absolute row number, you need to offset the output above with a constant number. Notice that the search range starts from E2, so you should offset the output by +1. Therefore, the final answer is:
=MATCH(185, E2:E13, 0) + 1
VLOOKUP
Given the search range, VLOOKUP locates a specific search key within the first column in terms of the row number and then returns a value at the column of your choice along that row. It requires four arguments—the search key, the range of search, the relative position of the column of your interest, and the match type. For example, the formula below locates the first pitcher along column C from C2 to C13 first. Similarly to MATCH, VLOOKUP supports multiple match types and FALSE indicates the exact match (i.e., not the “partial” match) for VLOOKUP.
=VLOOKUP(“Pitcher”, C2:E13, 3, FALSE)
Then, the formula reads a specific cell value in the same row. The column number is indicated as the third argument in the formula, called the relative position. It refers to how many columns the search key is away to the right from the column that is specified in the first argument. The first exact match is found at row 2, and the formula reads values in the third column to the right from column C: E. Thus, the value 220 in cell E2 is the result.
EXAMPLE A.4
Problem
Using VLOOKUP, find the age of Mike Thompson on ch1-SOCR-small.csv.
Solution
First you need to search for the player Mike Thompson. His row can be located by his name, “Mike_Thompson.” Then, the relative position argument of VLOOKUP allows you to read his age. The age column, F, is 6 columns to the left of Name, so the relative position argument should be 6. The fourth argument remains FALSE as you are looking for an exact match.
=VLOOKUP(“Mike_Thompson”, A2:F13, 6, FALSE)
The formula above will return 26.31.
Basic Visualization and Graphing Using Excel
Excel provides many options for generating graphs and other visualizations to assist in data science investigations.
The menu in Figure A9 shows some of the many options available within Excel for graphing.
Graphically Represent Data Using Excel
Graphical representation of data, or data visualization, can be done easily with Excel. Let’s draw a bar chart as an example. You can start by simply clicking Insert > Bar chart > Clustered Column icon (Figure A10).
Once the icon is clicked, you will see an arbitrary bar chart. It could be just a plain whitespace too—it really depends on what cell was selected when you clicked the icon.
Now click the chart area and it will reveal two new tabs—Chart Design and Format. Go to the Chart Design tab and click Select Data (Figure A11).
In the Select Data window, define Chart data range by highlighting A1:F13 (i.e., the entire table). This will auto-fill the rest of the pop-up window (Figure A12).
Three entries—Height, Weight, Age—refer to each group of bars. Each group will have 13 bars, each of which represents different baseball players. For now, let’s plot only Height. You can do so by removing the other two by clicking the “-” button under the entry list.
Click Ok and the bar chart is generated as Figure A13. Notice that columns A through C are displayed as the x-axis labels, as indicated on the pop-up window (see the Horizontal [Category] axis labels part in Figure A12).
If you wish to show only the player names, go back to the Select Data window and update the Horizontal axis labels to be just A2:A13 (Figure A14).
The resulting plot will only have player names along the x-axis (Figure A15).
Changing the formatting of different chart elements is easy in Excel as well. You can 1) click the element to change the text or drag to move the position of the element or 2) double-click the element in the corresponding pop-up menu. Figure A16 shows the pop-up menu for the chart title element. Notice that you can change different aspects of the presentation with the title element, such as Fill and Border.
EXAMPLE A.5
Problem
Using MS Excel, draw a scatterplot between Weight (x-axis) and Height (y-axis). A scatterplot is a plot that depicts each data point as a dot. The location of a dot is typically determined by the x- and y-axis.
Solution
There is an icon for scatterplots under Insert (see Figure A17). On the Select Data pop-up window, select D1:E13 since the weights and heights are in column D and E, as shown in Figure A17 and Figure A18.
The resulting scatterplot is shown in Figure A19. Notice that the numbers along the x-axis are heights, while those along the y-axis are weights.
You can change the chart title and the axis titles by double-clicking the title text and the axes themselves, as in Figure A20.
Once either the x- or y-axis is double-clicked, a different menu will show up and there you can add axis titles, as illustrated in Figure A21.
Basic Statistical Analysis Using Excel
Excel provides many statistical functions and statistical distributions that are used in data science applications.
Excel Analysis for Measures of Center and Dispersion
Excel allows you to type a formula into a cell. Using a formula, you can refer to a value from another cell or produce an output based on a value in a cell (or values from multiple cells). A formula always starts with an equal sign (=). Following is an example formula, calculating an average of three numbers in cells A1, A2, and A3.
= AVERAGE(A1, A2, A3)
Suppose you have numbers in cells A1, A2, and A3 and want to show the average of the three numbers in cell B1. Figure A22 shows how to use AVERAGE in this scenario.
Once you are done typing the formula in B1, press enter. The formula in B1 will change to the result of the formula – 20 (see Figure A23).
Excel does NOT save the formula when the file is saved as CSV
You can use Excel formulas with no problem on a CSV file, but Excel does not save those when the file is saved in the CSV format. If you want to keep the formula you have used, make sure to save your file as XLSX (File > Save As… > Excel Workbook (XLSX)). XLSX stores more information beyond text, such as formulas and drawings.
EXAMPLE A.6
Problem
On ch1-SOCR-small.csv, calculate the average height of the MLB players in the dataset using AVERAGE.
Solution
On the worksheet where the MLB Players dataset is open, type this formula in any empty cell and press the enter key (Figure A24).
=AVERAGE(D2:D13)
This will fill the average across cells D2, D3, D4, …, through D13. The resulting average height is 73.3333.
Other Useful Formulas for Summarizing Data
There are a lot of other ways to use simple Excel formulas to analyze data values. You can type each of the example formulas listed in Table A1 and see how it works.
Formula | Description |
---|---|
=MEDIAN(D2:D13) | Compute median of D2, D3, …, and D13. Returns 72.5 as a result. |
=STDEV(D2:D13) | Compute standard deviation of D2, D3, …, and D13. Returns 3.28 as a result. |
=MIN(D2:D13) | Find the minimum value among D2, D3, …, and D13. Returns 70 as a result. |
=MAX(D2:D13) | Find the maximum value among D2, D3, …, and D13. Returns 82 as a result. |
=SUM(D2:D13) | Sum the values of D2, D3, …, through D13. Returns 880 as a result. |
=COUNTIF(D2:D13, 72) | Count the number of occurrences of “72” within the range of D2, D3, …, through D13. Returns 3 as a result (i.e., there are three 72s in D2:D13). |
EXAMPLE A.7
Problem
On ch1-SOCR-small.csv, write a formula that calculates the median weight of the players.
Solution
The formula is
=MEDIAN(E2:E13)
This will find the median across cell E2, E3, E4, …, through E13, which is 200. If you do not know what median is, don’t worry! You will learn more about it in Measures of Position.
Similar to the built-in function for the mean (=AVERAGE), Excel provides a function to calculate the sample standard deviation of a dataset =STDEV.S (for the sample standard deviation). To calculate these statistical results in Excel, enter the data values in a column. Let’s assume the data values are placed in cells A2 through A11. In any cell, type the Excel command =AVERAGE(A2:A11) and press enter. Excel will calculate the arithmetic mean in this cell. Then, in any other cell, type the Excel command =STDEV.S(A2:A11) and press enter. Excel will calculate the sample standard deviation in this cell. Figure A25 shows the mean and standard deviation for the 10 ages.
Here is a more comprehensive listing of various statistical functions available within Excel. The reader is encouraged to go to the Formulas menu in Excel and select any of the submenus such as “Financial,” “Math & Trig,” Lookup & Reference,” etc. to see various Excel formulas. Under the option for “More Functions,” select STATISTICAL to see a menu of various statistical functions available within Excel. Table A2 shows a sample of some statistical functions available within Excel:
Function | Purpose |
---|---|
=AVERAGE(A1:A10) | Find the mean of a set of numbers. |
=MEDIAN(A1:A10) | Find the median of a set of numbers. |
=STDEV.S(A1:A10) | Find the standard deviation for a set of numbers representing a sample. |
=STDEV.P(A1:A10) | Find the standard deviation for a set of numbers representing a population. |
=VAR.S(A1:A10) | Find the variance for a set of numbers representing a sample. |
=VAR.P(A1:A10) | Find the variance for a set of numbers representing a population. |
=MIN(A1:A10) | Find the minimum of a set of numbers. |
=MAX(A1:A10) | Find the maximum of a set of numbers. |
=MAX(A1:A10) – MIN(A1:A10) | Find the range of a set of numbers. |
=CORREL(A1:A10, B1:B10) | Find the correlation coefficient “r” for (x, y) data. x-data is in cells A1 to A10; y-data is in cells B1 to B10. |
=QUARTILE (A1:A10, 1) | Find the first quartile for a set of numbers. |
=QUARTILE (A1:A10, 2) | Find the second quartile for a set of numbers (note that the second quartile is the same as the median). |
=QUARTILE (A1:A10, 3) | Find the third quartile for a set of numbers. |
Excel Analysis for Probability Distributions
As discussed in Measures of Position, data scientists are often interested in various probability distributions such as the normal distribution, binomial distribution, and Poisson distribution. Excel provides built-in functions to analyze many probability distributions.
Excel uses the command =NORM.DIST to find the area under the normal curve to the left of a specified value:
=NORM.DIST(XVALUE, MEAN, STANDARD_DEV, TRUE)
For example, suppose that at a financial consulting company, the mean employee salary is $60,000 with a standard deviation of $7,500. A normal curve can be drawn to represent this scenario, in which the mean of $60,000 would be plotted on the horizontal axis, corresponding to the peak of the curve. Then, to find the probability that an employee earns more than $75,000, you would calculate the area under the normal curve to the right of the data value $75,000.
For example, at the financial consulting company mentioned previously, the mean employee salary is $60,000 with a standard deviation of $7,500. To find the probability that a random employee’s salary is less than $55,000 using Excel, this is the command you would use:
=NORM.DIST(55000, 60000, 7500, TRUE)
Result: 0.25249
Thus, there is a probability of about 25% that a random employee has a salary less than $55,000.
Excel also provides built-in functions for binomial and Poisson distributions as shown in Table A3:
Probability Distribution | Excel Command | Usage |
---|---|---|
Binomial | =BINOMIAL.DIST | =BINOMIAL.DIST(Number_s, Trials, Probability_S, Cumulative) |
Poisson | =POISSON | =POSSON(X, Mean, Cumulative) |
Excel Analysis for Correlation and Regression
In Inferential Statistics and Regression Analysis, the concepts of correlation and regression were discussed.
In correlation analysis, the data scientist is often interested in calculating the correlation coefficient, which is a numerical measure of the direction and strength of the correlation between two numeric quantities.
The Excel command to calculate the correlation coefficient uses the following format:
=CORREL(A1:A10, B1:B10)
In regression analysis, the data scientist is interested in calculating the equation of the best-fit linear model for two numeric quantities (assuming the correlation is significant).
Recall the equation of the best fit line is
where m is the slope of the line and b is the y-intercept of the line.
To calculate the slope and y-intercept of the linear model using Excel, start by entering the (x, y) data in two columns in Excel. Then the Excel commands =SLOPE and =INTERCEPT can be used to calculate the slope and intercept, respectively.
The dataset in Table A4 will be used as an example: the monthly amount spent on advertising and the monthly revenue for a Fortune 500 company for 12 months.
Month | Amount Spent on Advertising | Revenue |
---|---|---|
Jan | 49 | 12210 |
Feb | 145 | 17590 |
Mar | 57 | 13215 |
Apr | 153 | 19200 |
May | 92 | 14600 |
Jun | 83 | 14100 |
Jul | 117 | 17100 |
Aug | 142 | 18400 |
Sep | 69 | 14100 |
Oct | 106 | 15500 |
Nov | 109 | 16300 |
Dec | 121 | 17020 |
To calculate the slope of the regression model, use the Excel command
=SLOPE(y-data range, x-data range)
It’s important to note that this Excel command expects that the y-data range is entered first and the x-data range is entered second. Since revenue depends on amount spent on advertising, revenue is considered the y-variable and amount spent on advertising is considered the x-variable. Notice the y-data is contained in cells C2 through C13 and the x-data is contained in cells B2 through B13. Thus the Excel command for slope would be entered as
=SLOPE(C2:C13, B2:B13)
In the same way, the Excel command to calculate the y-intercept of the regression model is
=INTERCEPT(y-data range, x-data range)
For the dataset shown in the above table, the Excel command would be
=INTERCEPT(C2:C13, B2:B13)
The results are as follows (see Figure A26):
Based on this, the regression equation can be written as
where x represents the amount spent on advertising (in thousands of dollars) and y represents the amount of revenue (in thousands of dollars).
Exploring Further
Excel Resources
For additional resources, refer to the chapters “Advanced Excel Formulas, Functions, and Techniques” and “Advanced Excel Spreadsheets: Statistical and Data Analysis” in Bolling, T., Mitchell, A., Scott, T., & Wheeler, N. (2023). Workplace Software and Skills. OpenStax. Retrieved from https://openstax.org/books/workplace-software-skills/pages/1-chapter-scenario.