Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

Learning Outcomes

By the end of this section, you should be able to:

5.1.1 Define time series data.
5.1.2 Identify examples of time series data in real-world applications.

Data can change over time, and it is important to be able to identify the ways that the data may change in order to make reasonable forecasts of future data. Does the data seem to be steadily rising or falling overall? Are there ups and downs that occur at regular intervals? Maybe there are predictable fluctuations in the data that are tied closely to the seasons of the year or multiyear business cycles.

Time series analysis allows for the examination of data points collected or recorded at specific time intervals, enabling the identification of trends, patterns, and seasonal variations crucial for making informed predictions and decisions across a variety of industries. It is widely used in fields such as business, finance, economics, environmental and weather science, information science, and many other domains in which data is collected over time.

Some examples of the areas in which time series analysis is used to solve real-world problems include the following:

Forecasting and prediction. Making predictions about future (unknown) data based on current and historical data may be the most important use of time series analysis. Businesses use such methods to drive strategy decisions and increase future revenues. Policymakers rely on forecasting to predict demographic changes that might influence elections and the need for public services.
Risk management. Time series analysis can help to quantify risk. For example, insurance companies may forecast trends that affect the number and total dollar amounts of claims homeowners may make in the future, which directly affects insurance premiums.
Anomaly detection. If data seems to deviate wildly from predictions, then this could point to a major change in some feature of the data or uncover a hidden factor that will need to be analyzed more carefully. For example, if a family’s spending patterns increase significantly beyond predicted levels, this could indicate either a sudden increase in household earnings or potential fraud.
Health care and epidemiology. Time series analysis is often used to monitor patient’s vitals, study how cancer progresses in individuals from various populations, track disease outbreaks, and predict future need for health care resources such as vaccines or additional beds in hospitals, among other things.

Time series analysis involves a number of statistical and computational techniques to identify patterns, make predictions, or gain insights into the data. The basic steps in time series analysis closely follow the data science cycle as described in What Are Data and Data Science? and are as follows:

Problem Identification/Definition. What is it that you need to know? Do you want to predict sales of a product over the next few years? Are you interested in studying the patterns of sunspots over time? Is your investment in a stock likely to lead to profit or loss?
Data Gathering. Once you have a question in mind, now you need data to analyze. This step can be very time-consuming if a suitable dataset does not already exist. In the best-case scenario, others have already collected data that you might be able to use. In the worst-case scenario, you would need to arrange for data collection by surveys, experiments, observations, or other means. Note that data collection is covered in some detail in Overview of Data Collection Methods.
Data Cleaning. Real-world data rarely comes in a form that is immediately able to be analyzed. Most often, the data must be cleaned, which includes dealing with missing data in some way and formatting the data appropriately so that a computer can understand it. This topic is covered in Web Scraping and Social Media Data Collection.
Model Selection. The next step is to fit the data to a model. Indeed, you may find that you must create numerous models and variations on those models before you are satisfied with the result.
Model Evaluation. This is perhaps the most important step. Any model that you create will be limited in its usefulness if it is not accurate enough. The model should include some indication of how confident you can be in its predictions. Moreover, when new observations become available, the model’s forecasts should be checked against them and adjusted as necessary to better fit the data.

What Is a Time Series?

Any set of data that consists of numerical measurements of the same variable collected and organized according to regular time intervals may be regarded as time series data. For example, see Table 5.1 on the S&P 500, an aggregate index of the top 500 publicly traded companies on the stock market, for the past several years. Here, the variable is the S&P index value at the last trading day of the year (point-in-time data).

Year	S&P Index at Year-End
2013	1848.36
2014	2058.90
2015	2043.94
2016	2238.83
2017	2673.61
2018	2506.85
2019	3230.78
2020	3756.07
2021	4766.18
2022	3839.50
2023	4769.83

Table 5.1 S&P Index Values from 2013 through 2023 (source: https://www.nasdaq.com/market-activity/index/spx/historical)

While the table is informative, it is not particularly easy to use to find trends or make predictions. A visualization would be better. Figure 5.2 displays the time series data using a time series chart (essentially a line chart). Notice that while the general trend is increasing, there are also dips in the years 2018 and 2022.

A line chart titled S&P Index at the End of Year 2013-2023. The X axis has years from 2013 to 2023 and the Y axis ranges from 1,000 to 6,000. A blue line represents a general upward trend over the past decade. The index starts at around 1,900 in 2013, rises to around 2,200 in 2016, then fluctuates between 2,000 and 4,000 until 2021, when it rises more sharply to 4,800 in 2021, drops to below 4,000 in 2002, then rises again to about 4,800 in 2023.

Figure 5.2 Line Chart (or Time Series Chart) of the S&P 500 Index. The chart makes it easier to spot the general upward trend over the past decade. (data source: adapted from S&P 500 [SPX] Historical Data)

The example shown in Figure 5.3 is a simple time series with only a single measure (S&P 500 index value) tracked over time. As long as we keep in mind that this data represents yearly values of the S&P 500 index starting in the year 2013, it is more efficient to consider the values alone as an ordered list.

1848.36, 2058.9, 2043.94, 2238.83, 2673.61, 2506.85, 3230.78, 3756.07, 4766.18, 3839.5, 4769.83

In mathematics, an ordered list of numbers is called a sequence. The individual values of the sequence are called terms. An abstract sequence may be denoted by $(x_{n})$ or ${(x_{n})}_{1 \leq n \leq N}$ . In both notations, $n$ represents the index of each value of the sequence, while the latter notation also specifies the range of index values for the sequence ( $n$ takes on all index values from 1 to $N$ ). That is,

{(x_{n})}_{1 \leq n \leq N} = {(x}_{1}, x_{2}, x_{3}, x_{4}, \dots, x_{N})

We will use the standard term time series to refer to a sequence of time-labeled data and use the term sequence when discussing the terms of the time series as an ordered list of values.

Not every sequence of data points is a time series, though. The set of current populations of the countries of the world is not a time series because the data are not measured at different times. However, if we focused on a single country and tracked its population year by year, then that would be a time series. What about a list of most popular baby names by year? While there is a time component, the data is not numerical, and so this would not fall under the category of time series. On the other hand, information about how many babies are named “Jennifer” each year would constitute time series data.

Typically, we assume that time series data are taken at equal time intervals and that there are no missing values. The terms of a time series tend to depend on previous terms in some way; otherwise, it may be impossible to make any predictions about future terms.

Time Series Models

A time series model is a function, algorithm, or method for finding, approximating, or predicting the values of a given time series. The basic idea behind time series models is that previous values should provide some indication as to how future values behave. In other words, there is some kind of function that takes the previous values of the time series as input and produces the next value as output:

x_{n + 1} = f (x_{n}, x_{n - 1}, x_{n - 2}, \dots, x_{1})

However, in all but the most ideal situations, a function that predicts the next value of the time series with perfect accuracy does not exist. Random variation and other factors not accounted for in the model will produce error, which is defined as the extent to which predictions differ from actual observations. Thus, we should always incorporate an error term (often denoted by the Greek letter $ε$ , called “epsilon”) into the model. Moreover, instead of expecting the model to produce the next term exactly, the model generates predicted values. Often, the predicted values of a times series are denoted by $({\hat{x}}_{n})$ to distinguish them from the actual values $(x_{n})$ . Thus,

{\hat{x}}_{n + 1} = f (x_{n}, x_{n - 1}, x_{n - 2}, \dots, x_{1}, ε)

In Time Series Forecasting Methods and Forecast Evaluation Methods, we will delve into the details of building time series models and accounting for error.

Time Series Forecasting

Typically, the goal of time series analysis is to make predictions, or extrapolations, about future values of the time series, a process known as forecasting. As a general rule, the accuracy of a forecast decreases as predictions are made further into the future. When future predictions become no more accurate than tossing a coin or rolling a die, then forecasting at the point or beyond becomes ineffective. In practice, time series models are updated regularly to accommodate new data.

Depending on the situation and the nature of the data, there are many different ways to forecast future data. The simplest of all methods, which is known as the naïve or flat forecasting method, is to use the most recent value as the best guess for the next value. For example, since the S&P 500 had a value of 3,839.5 at the end of 2022, it is reasonable to assume that the value will be relatively close to 3,839.5 at the end of 2023. Note, this would correspond to the time series model, ${\hat{x}}_{n + 1} = x_{n}$ . The naïve method has only limited use in practice.

Instead of using only the last observed value to predict the next one, a better approach might be to take into consideration a number of values, $x_{n}, x_{n - 1}, x_{n - 2}, \dots,$ to find the estimate ${\hat{x}}_{n + 1}$ . One might average the last $T$ values together (for some predefined value of $T$ ). This is known as a simple moving average, which will be defined explicitly in Components of Time Series Analysis. For now, let’s illustrate the idea intuitively. Suppose we used the average of the most recent $T = 3$ terms to estimate the next term. The time series model would be:

{\hat{x}}_{n + 1} = \frac{x_{n} + x_{n - 1} + x_{n - 2}}{3}

Based on the data in Table 5.1, the prediction for the S&P index value at the end of 2023 would work out as follows.

\frac{4,769.83 + 3,839.5 + 4,766.18}{3} = 4,458.5

Another simple method of forecasting is to fit the data to a linear regression model and then use the regression model to predict future values. (See Linear Regression for details on linear regression.) Linear regression does a better job at capturing the overall direction of the data compared to using only the last data point. On the other hand, linear regression will not be able to model more subtle structures in the data such as cyclic patterns. Moreover, there is a hidden assumption that the data rises or falls more or less uniformly throughout the period we would like to predict, which is often not a valid assumption to make.

Example 5.1

Problem

Find a linear regression model for the data from Table 5.1 and use it to forecast the value of the S&P at the end of the years 2024 and 2025.

Solution

Using standard statistical software, the linear regression is found to be $\hat{y} = 304.437 t - 611,288$ , where $t$ is the year. Alternatively, in Excel, the regression line can be added to the line graph using the “Trendline” feature, with option “Linear.” A graph of the regression line is shown in Figure 5.3.

A line chart titled S&P Index at the End of Year 2013-2023. The X axis has years from 2013 to 2023 and the Y axis ranges from 0 to 6,000. A blue line represents the actual index values, starting at around 1,800 in 2013, rising to around 2,200 in 2016, then fluctuating between 2,000 and 3,000 until 2019, when it rises sharply to nearly 5,000 in 2021, then falls and rises again to close to 5,000 in 2023. The orange line represents a linear trendline, showing a steady upward trend over the same period.

Figure 5.3 Line Chart of the S&P 500 Index with the Linear Regression (data source: adapted from S&P 500 [SPX] Historical Data)

The forecast for the end of year 2024 is $\hat{y} = 304.437 (2024) - 611,288 = 4,892.5$ .

Similarly, the prediction for 2025 is $\hat{y} = 304.437 (2025) - 611,288 = 5,196.9$ .

In this chapter, we will develop more sophisticated tools that can detect patterns that averages or linear models simply cannot find, including autoregressive models, moving averages, and the autoregressive integrated moving average (ARIMA) model. However, creating a fancy model for time series data is only part of the process. An essential part of the process is testing and evaluating your model. Thus, we will also explore measures of error and uncertainty that can determine how good the model is and how accurate its forecasts might be.

Examples of Time Series Data

Time series data is typically measured or observed at regular time intervals, such as daily, weekly, monthly, or annually. While we do hope to find some structure, pattern, or overall trend in the data, there is no requirement that time series data follows a simple, predictable pattern. Thus, we should always be careful when using any statistical or predictive models to analyze time series data. Regular validation and testing of the chosen models against new data are essential to ensure their reliability and effectiveness, which are concepts that we will explore in later sections.

Time series data could represent short-term, frequent observations, such as minute-by-minute monitoring of stock prices. It makes sense to measure stock prices at very short time intervals because they tend to be so volatile. Data is considered to be volatile or displaying high volatility if the individual data points show significant fluctuations from the mean, typically due to factors that are difficult to analyze or predict. Volatility can be measured using statistics such as variance and standard deviation; however, it is important to realize that even the level of volatility can change over time. Measuring data at more frequent intervals does help to reduce or control volatility, but it is still difficult to forecast volatile data to make future predictions.

Other sets of data represent more long-term measurements. For example, meteorologists may collect weather data, including temperature, atmospheric pressure, rainfall, wind speed and direction, in a particular location on a daily basis. Over multiple years, seasonal patterns emerge. Time series analysis may be used to pick up on the seasonal variations and then, after taking those variations into account, may find that the overall trend for temperatures is rising or falling in the long term.

Sales data may be measured weekly or monthly. This kind of data is useful in identifying peak sales periods and slumps when customers are not buying as much. Important business decisions are often driven by analysis of time series data.

Finally, very long-term time series may be useful in the sciences. Monitoring levels of atmospheric carbon over time and comparing these measurements with the levels found by extracting ice cores from Antarctica can demonstrate a statistically significant change to long-term cycles and patterns.

Working with and Visualizing Time Series Data in Excel

You may recall from What Are Data and Data Science? that spreadsheet applications are very well suited for basic analysis and visualization of time series data. Simply create two columns, one for the time periods and the other for the data. Highlight both columns (together with their headings) and insert a line chart using the “Recommended Charts” feature in Excel¹, resulting in a chart such as the one shown in Figure 5.4.

Figure 5.4 Using Excel to Create a Line Chart. Instead of using “Recommended Charts,” you could also choose line chart from the panel of choices located on the same tab. (Used with permission from Microsoft)

Working with and Visualizing Time Series Data in Python

In Python, the pandas library is incredibly useful for reading in CSV files and managing them as data structures called DataFrames. It also includes a simple command for visualizing the time series, plot(). Here is how to create a line chart of the daily values of the S & P 500 Index from 2014 through 2024 from the dataset SP500.csv.

Python Code

      import pandas as pd
      import matplotlib.pyplot as plt
      import matplotlib.ticker as ticker
      
      # Read the CSV file into a Pandas DataFrame
      df = pd.read_csv('SP500.csv')
      
      # Create a plot
      fig, ax = plt.subplots()
      df.plot(ax=ax)
      
      # Use FuncFormatter to add commas to the y-axis
      ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{:,}'.format(int(x))))
      
      # Display the plot
      plt.show()

The resulting output will look like this:

A screenshot of a line graph Python output. The X axis ranges from 0 to 2,500 and the Y axis ranges from 0 to 5,000. A jagged blue line representing the SP 500 shows a general upward trend.

Notice that the simple plot feature did not label the $x$ or $y$ axes. Moreover, the x-axis does not show actual dates, but instead the number of time steps. For simple visualizations that you do not intend to share with a wider audience, this may be ok, but if you want to create better-looking and more informative graphs, use a library such as matplotlib.pyplot. In the next set of Python code, you will see the same dataset visualized using matplotlib. (Note that there is a line of code that uses the to_datatime() command. Dates and times are notoriously inconsistent from dataset to dataset, and software packages may not interpret them correctly without a little help.)