Chapter Outline
Inferential statistics plays a key role in data science applications, as its techniques allow researchers to infer or generalize observations from samples to the larger population from which they were selected. If the researcher had access to a full set of population data, then these methods would not be needed. But in most real-world scenarios, population data cannot be obtained or is impractical to obtain, making inferential analysis essential. This chapter will explore the techniques of inferential statistics and their applications in data science.
Confidence intervals and hypothesis testing allow a data scientist to formulate conclusions regarding population parameters based on sample data.
One technique, correlation analysis, allows the determination of a statistical relationship between two numeric quantities, often referred to as variables. A variable is a characteristic or attribute that can be measured or observed. A correlation between two variables is said to exist where there is an association between them. Finance professionals often use correlation analysis to predict future trends and mitigate risk in a stock portfolio. For example, if two investments are strongly correlated, an investor might not want to have both investments in a certain portfolio since the two investments would tend to move in the same direction as market prices rose or fell. To diversify a portfolio, an investor might seek investments that are not strongly correlated with one another.
Regression analysis takes correlation analysis one step further by modeling the relationship between the two numeric quantities or variables when a correlation exists. In statistics, modeling refers specifically to the process of creating a mathematical representation that describes the relationship between different variables in a dataset. The model is then used to understand, explain, and predict the behavior of the data.
This chapter focuses on linear regression, which is analysis of the relationship between one dependent variable and one independent variable, where the relationship can be modeled using a linear equation. The foundations of regression analysis have many applications in data science, including in machine learning models where a mathematical model is created to determine a relationship between input and output variables of a dataset. Several such applications of regression analysis in machine learning are further explored in Decision-Making Using Machine Learning Basics. In Time Series and Forecasting, we will use time series models to analyze and predict data points for data collected at different points in time.