Several measurements are used to provide the average of a data set, including mean, median, and mode. The terms mean and average are often used interchangeably. To calculate the mean for a set of numbers, add the numbers together and then divide the sum by the number of data values. The geometric mean redistributes not the sum of the values but the product by multiplying all of the individual values and then redistributing them in equal portions such that the total product remains the same. To calculate the median for a set of numbers, order the data from smallest to largest and identify the middle data value in the ordered data set.
The standard deviation and variance are measures of the spread of a data set. The standard deviation is small when the data values are all concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation. The formula used to calculate the standard deviation depends on whether the data represents a sample or a population, as the formulas for the sample standard deviation and the population standard deviation are slightly different.
Several measures are used to indicate the position of a data value in a data set. One measure of position is the z-score for a particular measurement. The z-score indicates how many standard deviations a particular measurement is above or below the mean. Other measures of position include quartiles and percentiles. Quartiles are special percentiles. The first quartile, Q1, is the same as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile. The median, M, is called both the second quartile and the 50th percentile. To calculate quartiles and percentiles, the data must be ordered from smallest to largest. Quartiles divide ordered data into quarters. Percentiles divide ordered data into hundredths.
A frequency distribution provides a method of organizing and summarizing a data set and allows us to organize and tabulate the data in a summarized way. Once a frequency distribution is generated, it can be used to create graphs to help facilitate an interpretation of the data set. The normal distribution has two parameters, or numerical descriptive measures: the mean, , and the standard deviation, . The exponential distribution is often concerned with the amount of time until some specific event occurs.
A probability distribution is a mathematical function that assigns probabilities to various outcomes. In many financial situations, we are interested in determining the expected value of the return on a particular investment or the expected return on a portfolio of multiple investments. When analyzing distributions that follow a normal distribution, probabilities can be calculated by finding the area under the graph of the normal curve.
Data visualization refers to the use of graphical displays to summarize a data set to help to interpret patterns and trends in the data. Univariate data refers to observations recorded for a single characteristic or attribute, such as salaries or blood pressure measurements. When graphing univariate data, we can choose from among several types of graphs. The type of graph to be used for a certain data set will depend on the nature of the data and the purpose of the graph. Examples of graphs for univariate data include line graphs, bar graphs, and histograms. Bivariate data refers to paired data where each value of one variable is paired with a value of a second variable. Examples of graphs for bivariate data include time series graphs and scatter plots.
R is an open-source statistical analysis tool that is widely used in the finance industry. It provides an integrated suite of functions for data analysis, graphing, and statistical programming. R is increasingly being used as a data analysis and statistical tool as it is an open-source language, and additional features are constantly being added by the user community. This tool can be used on many different computing platforms and can be downloaded at The R Project for Statistical Computing.