Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

LEARNING OUTCOMES:

By the end of this section, you should be able to:

10.2.1 Identify meaningful assumptions, state how they are used in the modeling process, and provide justifications for each assumption.
10.2.2 Discuss various error measures in the context of a modeling problem.
10.2.3 Perform sensitivity analysis on models and interpret the results to determine how different variables and assumptions impact the outcomes of a model.
10.2.4 Write informative summaries of strengths and weaknesses of a model.

As discussed in Time Series and Forecasting, modeling is the process of creating a mathematical representation of real-world phenomena to allow for predictions and insights based on data. Modeling involves selecting appropriate algorithms, training the model on historical data, and validating its performance to ensure it generalizes well to new, unseen data. Prior to developing models, it is critical to identify and clearly state assumptions and provide justifications for the data collection and analysis strategies used. In this section we present the methods used for assessing how a model's predictions align with the actual observed data and introduce the concept of model validation. We also briefly discuss approaches used to understanding the robustness of different models and share practices to support documentation and feedback loops for a balanced and transparent analysis.

Stating Assumptions and Justifications

A key element in data science report writing is the ability to identify and document significant assumptions that emerge during the foundation of a project or task. In general, an assumption is a statement that is thought to be true without being verified or proven. Assumptions often stem from one’s experience, accumulated knowledge, or even intuition. For data scientists, assumptions have a very specific meaning: they are foundational hypotheses or beliefs about the structure, relationships, or distribution of data that guide the analytical approach and model selection for a project. These assumptions play an important role in steering the decision-making process and shaping the solution's development. However, it is important to note that assumptions come with their own risks. If the assumptions turn out to be false or inaccurate, they can lead to bias, errors, delays, or outright failures in the project. Therefore, documenting assumptions in a clear and explicit manner is of utmost importance. Providing justifications for each assumption is equally crucial.

Assumptions vs. Constraints

Assumptions should be documented within the methodology section of the report. When it comes to writing the assumptions section, several best practices should be followed. First, it is important to distinguish between assumptions and constraints. Assumptions are things we believe to be true but have not yet confirmed, while constraints are limitations or restrictions that are imposed on a project or its solution. A common data science–related assumption is the belief that the historical data used to train a predictive model is representative of future conditions, implying that patterns observed in the past will persist in the future. This assumption underpins the use of time series analysis and forecasting models in predicting future trends based on historical data patterns. By contrast, a common constraint encountered is the limited availability of high-quality, relevant data, which can significantly restrict the depth and accuracy of analyses. Additionally, computational limitations, such as processing power and memory capacity, can pose significant challenges in handling large datasets or complex models, impacting the feasibility and scalability of data science projects. Yet other constraints are imposed from the project sponsors, including finding solutions that do not exceed certain resource requirements or that must attain certain goals.

Using precise and unambiguous language and avoiding any vagueness or generalities helps in accurately stating the assumptions. Assumptions should be verifiable statements, meaning they should be capable of being evaluated or validated with evidence or data. This approach helps data scientists avoid subjective opinions or personal biases.

Example 10.2

Problem

A manufacturing company commissioned a report on sales and revenue at different production levels. The company has no more than 13 machines that can be used to make the product, and there are 29 employees that are capable of operating the machines. List the constraints of the project along with any implied assumptions. In addition, formulate follow-up questions that you should ask the executives of the company before you begin any data analysis.

Solution

Constraints:

The company has a maximum of 13 machines, limiting the number of products that can be produced simultaneously.
There is also a limit on the number of employees, with at most 29 being capable of operating the machines.

Assumptions:

All 13 machines can be used if there are enough employees to operate them. The 29 employees are sufficient to operate all 13 machines simultaneously. (Note: We do not know how many employees are required to operate a machine, so this needs to be verified.)
Machines and operators are assumed to have uniform characteristics, meaning that the volume of production depends only on the number of machines and operators available.
Production will not be interrupted by downtime of either the machines or the employees. The market demand is sufficient to absorb the production levels at various stages, meaning all produced goods can be sold.

(These are only some of the assumptions that could be listed.)

Follow-up questions for the company:

What is the maximum output capacity of each machine per hour/day?
How many employees are required to operate each machine?
Do the machines have uniform performance, or are some capable of producing more of the product? If so, are additional employees required to work the higher-performing machines (e.g., some machines may be larger and require more operators, producing more of the product as a result)?
What is the average uptime and downtime for each machine?
How often are machines maintained? Are there any specific times when machines are not operational (e.g., scheduled maintenance, shift changes)?
Regarding the employees, are there scheduling factors, differentiated expertise levels, or any other factors that must be considered?
Are there any supply chain or market limitations?

(Many more questions may be formulated beyond this basic sample of inquiries.)

Nonconstant Assumptions

In addition to the above practices, it is essential to include relevant time frames for the validity of the assumptions. This involves specifying when the assumptions are applicable and noting any changes or updates that might impact them. For example, it may be assumed that sales for a new tech product will be very high only in the months leading up to the holiday season.

Providing rationales or context—that is, explaining the basis for each assumption and how it relates to the project's objectives, requirements, or scope—is also crucial. External factors that may influence the assumptions, such as market conditions, customer preferences, or legal regulations, should be considered. These factors could change over time, and so it is important to note that in the modeling process.

Moreover, acknowledging any resource limitations, like budget, staff, equipment, or materials, is vital as these may affect the project's execution or outcome. Technical limitations, including those related to computing, software, hardware, or network, should also be recognized as they can impact the project's functionality or performance and may also change in the future.

For each assumption made, the technical writer should offer a justification supported by evidence, data, or logical reasoning. For example, we might assume that a software product will be compatible with the Microsoft Windows 11 operating system because market research data shows that Windows 11 is the most widely used operating system among the target audience. Another assumption could be that users will have basic computer literacy and familiarity with similar software products based on customer feedback surveys indicating that they are professionals in the field who have used previous versions of the software or related products from competitors. We might also assume that users will have access to online help and support resources since the software product includes a built-in help feature linking to the company's website, where FAQs, tutorials, videos, and contact information are available. Finally, the assumption that the software product will receive regular updates to fix bugs and improve features can be justified by referring to the company's quality assurance policy and customer satisfaction strategy.

To assess whether technical and data science–related assumptions are being met, rigorous validation techniques such as cross-validation and sensitivity analysis are often employed. (This will be discussed further below.) Careful evaluation and validation of these assumptions help to prevent biased results and enhance the generalizability of the model to various datasets and real-world scenarios. Additionally, continuous monitoring and updating of models with new data help in adjusting to changes that were not anticipated by the initial assumptions, ensuring models remain accurate and relevant over time.

Implicit Bias

It is also important to examine ethical considerations related to assumptions and implicit bias in developing data science reports (see Ethics Throughout the Data Science Cycle), as these factors can significantly impact the fairness and accuracy of the findings. Assumptions, if not rigorously validated, can lead to skewed results that misrepresent the data and its implications. Implicit bias, whether in data collection, analysis, or interpretation, can introduce systemic errors that disproportionately affect certain groups, leading to inequitable outcomes. Therefore, it is necessary to examine and challenge assumptions, employ diverse and representative datasets, and implement techniques to detect and mitigate bias. Often, the best approach is to allow a third party to vet the methodology, identifying assumptions that you may have overlooked. Transparency in methodology and a commitment to ethical standards ensure that data science reports are not only scientifically sound but also socially responsible, promoting trust and integrity in data-informed suggestions and decision-making.

By thoroughly documenting these assumptions and their justifications, technical writers can ensure that the assumptions align with the project's scope and requirements. Additionally, this documentation provides a framework for these assumptions to be validated or revised as necessary throughout the project lifecycle.

Interpreting Measures of Fit

As we saw in Time Series and Forecasting, What Is Machine Learning?, and Deep Learning and AI Basics, models need to be evaluated using appropriate measures of fit—that is, the metrics that assess how well a model's predictions align with the actual observed data. Metrics such as R-squared in regression or accuracy in classification are integral in this process. However, it is not sufficient to merely state these values; a comprehensive technical report must explore their meanings and limitations within the context of the specific problem being addressed. High values for measures of fit are not always indicative of a good model. Conversely, low error values may not always mean that the model will do a good job on future data. There are further considerations to explore, such as the bias-variance trade-off, the impact of outliers on the results, and potential for small changes in input to produce relatively large deviations in output (sensitivity). When reporting on the metrics for model fit, it is vital to put these values into their proper context, include accounting for the model's significance, avoiding overfitting, and ensuring robustness.

Validation Assessment

A model of statistical validation can help strengthen the technical report and support consumer confidence with the inclusion of the information. However, in certain instances the results of the validity of fitness testing may not result in favorable outcomes. To maintain and convey integrity in the work, it is essential to report the results in an honest and accurate manner, even when those results are not favorable. In addition, one may consider bringing on an external evaluator to provide an extra level of validation assessment. The external practitioner may provide additional technical expertise to confirm practices or make recommendations for adjustment.

Recall from Correlation and Linear Regression Analysis that, $r^{2}$ , or R-squared, also known as the coefficient of determination, measures the proportion of the variance in the dependent variable that is predictable from the independent variables. Consider an example of a linear model in which the factors (feature variables) are square footage of the house, number of bathrooms, proximity to a city center, local school district ranking, etc., while the response (dependent variable) is the market value of the house. After training, the linear regression model has an $r^{2}$ value of 0.75. In simple terms, this value suggests that 75% of the variation in market value can be explained by the features used in the model (e.g., size, number of bathrooms, location) and that 25% of the variation is due to noise and other factors not accounted for in the model. This seems to imply that the model is relatively effective at capturing the relationships between the independent variables and house prices, but it is important to understand what this means precisely. It does not mean that the model predicts house prices with 75% accuracy. Instead, it suggests that a significant portion (75%) of the variance in housing prices is captured by the model.

In the context of housing prices, an $r^{2}$ of 0.75 might be considered strong, given the high variability in real estate markets due to numerous factors, some of which may not be included in the model (e.g., economic trends, unquantifiable factors like scenic views). It is crucial to remember that a high $r^{2}$ value does not imply causation. In other words, we may not conclude that the local school district ranking necessarily causes house prices to go up. Any correlation between housing prices and school district ranking may be due to some other factor, such as median income of the residents in the neighborhood.

Additionally, $r^{2}$ cannot determine whether the coefficient estimates and predictions are unbiased, which is why we must also assess other diagnostic measures and look at residuals to check for patterns that might indicate issues with the model. Perhaps there are interactions between variables that should be included in the model, implying that a quadratic model might be more appropriate than a linear one, for example. Moreover, the features might explain some of the remaining variability in the response; however, this estimate may be impacted by the sample size, presence of outliers, and the extent to which model assumptions are met.

If the analysis had used a different set of features or a different kind of model (e.g., decision tree), the results may have ended up with a different degree of accuracy. Comparing different models for the same set of data can help in appreciating the impact of different features on the prediction accuracy. By understanding that our model explains a significant portion of the variance, we can use it to make informed decisions about housing prices. However, recognizing its limitations and the factors it does not account for is crucial in avoiding overreliance on the model for price prediction. Future improvements could include adding more relevant features, using different modeling techniques, and/or collecting more data to improve the model's accuracy and reliability.

Example 10.3

Problem

A data science team is examining the diabetes dataset diabetes_data.csv utilizing a classification decision tree (see Decision-Making Using Machine Learning Basics) to better understand how factors such as cholesterol level, sex, height, weight, BMI, and others correlate with the presence of diabetes in patients. Use modeling best practices to evaluate the model and discuss how overfitting will be avoided.

Solution

Python Code

  #load libraries for decision trees and visualization
  
  import pandas as pd
  from sklearn.tree import DecisionTreeClassifier, plot_tree
  from sklearn.model_selection import train_test_split
  import matplotlib.pyplot as plt
  
  #load data. The code for your dataset will be dependent upon the file path you #set up
  df = pd.read_csv('diabetes_data.csv', index_col='Patient number')

When preparing data to run various analyses or models, it is recommended to examine the data characteristics to ensure that the data is properly formatted and has the appropriate data types for the analysis or modeling technique. The function df.info() provides the list of features of the data set (column names) and their types (Dtype).

Python Code

  #examine the dataset
  df.info()

The resulting output will look like this:


Data columns (total 13 columns):
#    Column        Non-Null Count  Dtype
---   ------        --------------  -----
0    Cholesterol   390 non-null    int64
1    Glucose       390 non-null    int64
2    HDL Chol      390 non-null    int64
3    Age           390 non-null    int64
4    Gender        390 non-null    object
5    Height        390 non-null    int64
6    Weight        390 non-null    int64
7    BMI           390 non-null    float64
8    Systolic BP   390 non-null    int64
9    Diastolic BP  390 non-null    int64
10   waist         390 non-null    int64
11   hip           390 non-null    int64
12   Diabetes      390 non-null    object
dtypes: float64(1), int64(10), object(2)
memory usage: 42.7+ KBInt64Index: 390 entries, 1 to 390Data columns (total 13 columns):#    Column        Non-Null Count  Dtype---   ------        --------------  -----0    Cholesterol   390 non-null    int641    Glucose       390 non-null    int642    HDL Chol      390 non-null    int643    Age           390 non-null    int644    Gender        390 non-null    object5    Height        390 non-null    int646    Weight        390 non-null    int647    BMI           390 non-null    float648    Systolic BP   390 non-null    int649    Diastolic BP  390 non-null    int6410   waist         390 non-null    int6411   hip           390 non-null    int6412   Diabetes      390 non-null    objectdtypes: float64(1), int64(10), object(2)memory usage: 42.7+ KB

To use the sklearn library, all data must be numeric. However, if you have columns with string data, there are several methods to convert them to numeric values. For “True”/“False” values, simply use .astype ('int'). For other strings, you can employ the pandas functions like map, replace, or apply.map is generally the fastest and most efficient of these, but it lacks flexibility, as it converts unmatched values to NaN. replace is more versatile, allowing partial or full data replacement, but it's slower than map. map is preferable for large datasets or in production due to its speed.

Python Code

  #convert categorical data to numeric values
  df['Gender'] = df['Gender'].replace({'male': 0, 'female': 1})

To prepare the data for modeling, you will set up features and targets. In addition, you will want to establish a set of training and testing data.

Python Code

  #set up data features and targets and set up training and testing datasets
  features = df.drop('Diabetes', axis=1)
  targets = df['Diabetes']
  
  x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=42)

The modeling technique used is a decision tree, which will operate similarly to logistic regression in sklearn as you create the class and then use the fit method. It has the same score method and other methods like predict.

Python Code

  #run a decision tree
  dt = DecisionTreeClassifier()
  dt.fit(x_train, y_train)
  
  print(dt.score(x_train, y_train))
  print(dt.score(x_test, y_test))

The resulting output will look like:

1.0
0.826530612244898

The accuracy observed on the training set is clean, registering at 100%, whereas the accuracy on the test set is significantly lower, recorded at 87.7%. This discrepancy is a signal of overfitting. Doing further analysis by examining the depth of the tree and visualizing it would be beneficial.

Python Code

  #measure the depth of the tree
  dt.get_depth()

The resulting output will look like:

Seeing how deep the tree got and the number of samples in the leaf nodes illustrates that the model is likely overfitting on the data. We can restrict the number of levels (depth) with max_depth. You can try a few values below and see how it changes. Here, we settled on two since that results in nearly equal train/test scores and looks like it reduces the overfitting. Pruning can also be applied to decision trees; as defined in Decision Trees, pruning involves the removal of sections of the tree that provide little predictive power to improve the model's accuracy and prevent overfitting.

Python Code

  #adjust the depth of the decision tree to 2
  dt = DecisionTreeClassifier(max_depth=2)
  dt.fit(x_train, y_train)
  
  print(dt.score(x_train, y_train))
  print(dt.score(x_test, y_test))

The resulting output will look like:

0.9315068493150684
0.9081632653061225

Python Code

  #create a decision tree with a max depth of 2
  f = plt.figure(figsize=(8, 8))
  _ = plot_tree(dt, fontsize=10, feature_names=features.columns.tolist(), filled=True)

The resulting output will look like:

A decision tree diagram with glucose levels as the feature. The root node represents all samples with a glucose level less than or equal to 132.0. The tree branches into two nodes based on whether the glucose level is less than or equal to 107.5 or greater than 107.5 and less than or equal to 175.0. Each node displays the gini impurity, samples count, and value.

Validation Techniques

Suppose that your data science team has come up with multiple models, each of which seems to do well on the training and testing sets. How might you choose between the various models? This is where validation plays a role. Validation is a broad term for the evaluation of multiple predictive models and/or fine-tuning of certain constants that affect the performance of a model (hyperparameter tuning). These techniques provide a means to rigorously test the model's performance on unseen data, thereby reducing overfitting and ensuring that the model captures as much of the true underlying patterns in the data as possible rather than noise or random patterns. Be sure to fully explain all validation methods used and how you selected the best models in the data exploration and analysis part of your report.

Validation typically involves setting aside a portion of the data that is used to test multiple different models after training. This is different than splitting the data into a training set and testing set (as covered in Decision-Making Using Machine Learning Basics), in which the model is trained only once on the training set and then evaluated on the test set. When validation is used, the dataset is split into three parts, training, validation, and testing sets. After a few rounds of training and validation, the best model is chosen, and then finally the testing set is used on the best model to find a measure of fit. Thorough validation processes are crucial for building robust models that perform well in real-world applications, as they allow for model selection and fine-tuning before deployment, ultimately leading to more reliable and effective predictive analytics.

Cross-validation, a variation on validation, divides the training set into multiple subsets and iteratively trains and evaluates the model on different combinations of these subsets, offering a more comprehensive evaluation by leveraging all available data. Generally, cross-validation will be done on each model under consideration, which may take considerable time and resources to accomplish. Although time-consuming, cross-validation helps to prevent fitting to random effects in the data.

One simple way to do cross-validation is the bootstrap method (discussed in some detail in Machine Learning in Regression Analysis). Bootstrapping is a powerful statistical technique used to estimate the distribution of a sample by repeatedly resampling with replacement. In the context of model evaluation, the bootstrap method involves creating multiple new datasets, or bootstrap samples, by randomly sampling from the original dataset, allowing duplicates. Each bootstrap sample is used to train and evaluate the model, and the performance metrics are averaged over all samples to provide an estimate of model accuracy and variability. This may be done on multiple different models and values of hyperparameters to arrive at the best model. The bootstrap method is highly flexible and can be applied to a wide range of statistical problems, but it can be computationally intensive, especially with large datasets. Nevertheless, it is a valuable tool for obtaining robust estimates of model performance and understanding the variability in predictions.

Additionally, there are two common cross-validation strategies called k-fold cross-validation and leave-one-out cross-validation.

k-fold cross-validation works by dividing the dataset into $k$ equally sized subsets, or folds. The model is trained on $k - 1$ of the folds and tested on the remaining fold, a process that is repeated $k$ times with each fold serving as the test set once. This approach ensures that every data point is used for both training and validation, providing a more comprehensive assessment of the model's generalizability. This method is particularly advantageous when the dataset is limited in size, as it maximizes the use of available data for both training and testing.

Leave-one-out cross-validation (LOOCV) is a special case of k-fold cross-validation where $k$ is set to the number of data points in the dataset, meaning that each fold contains only one observation. In LOOCV, the model is trained on all data points except one, which is used as the validation set, and this process is repeated for each data point in the dataset. This method provides an exhaustive validation mechanism, ensuring that every single data point is used for testing exactly once, thus offering an unbiased evaluation of the model's performance. However, LOOCV can be computationally expensive, especially for large datasets, because it requires training the model as many times as there are data points. Despite its computational intensity, LOOCV is particularly useful for small datasets where retaining as much training data as possible for each iteration is crucial for model accuracy.

Accuracy and Error

Once you have settled on a validation strategy, how do you evaluate the performance of your models so that you can select the best one? In previous chapters, we have introduced numerous measures of fit, some of which are useful in classification tasks, while others are more suitable for regression tasks, as we’ll see below.

Classification Accuracy

Recall from What Is Machine Learning? that accuracy is calculated as the proportion of correctly predicted instances out of the total instances in a classification task. For instance, consider the question of using a classification model to predict whether an email is spam or not based on its content and metadata. If the model correctly predicts 90 out of 100 emails as spam or not spam, the accuracy is 90%. While 90% appears to be a high level of accuracy, it is necessary to delve deeper into the understanding of this metric. In cases where the data set is imbalanced, such as when 95% of emails are not spam, an accuracy of 90% might not be as impressive. This scenario could mean that the model is merely predicting the majority class (non-spam) most of the time.

To gain a more nuanced understanding of the model's performance, additional measures like precision, recall, and the F1 score are considered. We saw in What Is Machine Learning? that precision measures the proportion of correctly identified positive instances (true positives) among all instances that the model predicts as positive—that is, $T P / (T P + F P)$ . On the other hand, recall measures the proportion of true positives among all positives in dataset—that is, $T P / (T P + F P)$ . Finally, the $F_{1}$ -score provides a balance between precision and recall, offering a single metric to assess the model's accuracy in cases where there is a trade-off between the two.

F_{1} = \frac{2 \times precision \times recall}{precision + recall}

For example, a high precision with a low recall might indicate that the model is too conservative, only labeling an email as spam if it is very sure and leading to much spam being missed.

It is also vital to examine the confusion matrix, which provides a detailed breakdown of the model's predictions, showing the number of true positives, false positives, true negatives, and false negatives (see Classification Using Machine Learning). The confusion matrix helps to explain the nature of errors the model is making. For instance, a high number of false positives (non-spam emails classified as spam) might be more undesirable than false negatives in an email classification scenario, as it could lead to important emails being missed. Therefore, while accuracy gives an initial indication of the model's performance, a comprehensive evaluation requires analyzing additional metrics and understanding their implications in the specific context of the data and the problem at hand. This approach ensures that decisions based on the model's predictions are informed and reliable, considering not just the overall accuracy but also the nature and implications of the errors it makes.

Regression Error Measures

For regression models, measures of error are used to evaluate model performance, including those discussed in Time Series and Forecasting and Decision-Making Using Machine Learning Basics: mean absolute error (MAE), mean square error (MSE), mean absolute percentage error (MAPE), and root mean square error (RMSE), in addition to statistical measures of fit such as R-squared or the Pearson correlation coefficient (r), which are detailed in Statistical Inference and Confidence Intervals. Adding one more measure to this list, consider the mean percentage error (MPE), which is a metric used to assess the accuracy of predictions in a regression model by calculating the average of the percentage errors between predicted $({\hat{y}}_{i})$ and actual $(y_{i})$ values.

MPE = \frac{1}{n} \sum_{i = 1}^{n} \frac{y_{i} - {\hat{y}}_{i}}{y_{i}}

The mean percentage error (MPE) is like the MAPE except that the absolute value of the difference is not used. Instead, MPE will tend to be smaller than MAPE when the predictions are balanced between higher than actual and lower than actual. However, one drawback of MPE is its susceptibility to skewness from large percentage errors, especially when actual values are close to zero, potentially leading to misleading results. Additionally, MPE can yield negative values if predictions consistently overestimate or underestimate the actual values, complicating the interpretation. Despite these limitations, MPE can offer valuable insights when used in conjunction with other error metrics, contributing to a more nuanced understanding of model performance.

AIC and BIC

There are two related measures that are often used in validation, the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). Most statistical software packages will compute AIC and BIC, but it is helpful to see the formulas. Let $n$ be the number of data points, $k$ the number of parameters (or degrees of freedom), and $L$ the maximum likelihood of the model (see Classification Using Machine Learning).

\begin{array}{rcl} AIC & = & 2 k - 2 \ln (L) \\ BIC & = & k \ln (n) - 2 \ln (L) \end{array}

Both AIC and BIC are criteria used for model selection and validation in the context of statistical models. They help in choosing the best model among a set of candidate models by balancing goodness of fit and model complexity. BIC imposes a harsher penalty on the number of parameters compared to AIC and so will select simpler models overall. During the model selection process, various candidate models are fitted to the data. AIC and BIC values are computed for each model, and the model with the lowest criterion value is selected:

Python Code

   # cross validation with Bayesian Information Criterion
  
  from sklearn.model_selection import cross_val_score, GridSearchCV
  import numpy as np # Import numpy
  
  # Define the decision tree classifier
  dt = DecisionTreeClassifier()
  
  # Perform cross-validation with 5 folds
  scores = cross_val_score(dt, features, targets, cv=5)
  
  # Print the average accuracy
  print(f"Average accuracy over 5 folds: {scores.mean():.2f}")
  
  # Define the parameter grid for grid search
  param_grid = {
    'max_depth': range(1, 10),
    'min_samples_leaf': range(1, 10),
    'min_samples_split': range(2, 10)
  }
  
  # Perform grid search with cross-validation
  grid_search = GridSearchCV(dt, param_grid, cv=5)
  grid_search.fit(features, targets)
  
  # Print the best parameters and score
  print(f"Best parameters: {grid_search.best_params_}")
  print(f"Best score: {grid_search.best_score_:.2f}")
  
  # Print the BIC score for the best model
  best_model = grid_search.best_estimator_
  bic = len(features) * np.log(len(targets)) - 2 * best_model.score(features, targets) # Now np is defined
  print(f"BIC score: {bic:.2f}")

The resulting output will look like this:

Average accuracy over 5 folds: 0.41
Best parameters: {'max_depth': 1, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best score: 0.72
BIC score: 2165.56

The goal of BIC is to balance the fit of the model with its complexity, helping to avoid overfitting. A lower BIC score indicates a better model, as it suggests that the model has a good fit to the data while being relatively simple. For instance, the BIC score of 2,165.56 would be considered good if it is lower than the BIC scores of competing models, indicating it provides a better balance between goodness of fit and complexity.

Sensitivity Analysis

Sensitivity analysis is an analytical technique used to determine how different sources of uncertainty in the inputs of a model or system affect its output. Generally speaking, sensitivity analysis is accomplished by varying inputs or parameters slightly—by up to 5%, for example—and then examining how the output changes. If the output varies too much under small changes of input, then the model may be unstable, requiring different modeling techniques to improve stability.

Particularly in data science, where complex models with numerous variables and sources of uncertainty are common, sensitivity analysis is invaluable. For instance, it quantifies the relationship between dataset size and model performance in a specific prediction problem, aiding in determining the requisite data volume for achieving desired accuracy levels. Additionally, it identifies key input variables that significantly influence the model output, thereby focusing attention on the most pertinent features and parameters for the problem domain.

Sensitivity analysis in data science serves multiple purposes such as identifying influential input variables, assessing the model's robustness against input variations, and uncovering potential biases. Various methods are employed for conducting sensitivity analysis, tailored to the model's complexity and the required detail level. Below are a few examples of commonly used sensitivity analysis techniques.

One-way sensitivity analysis examines how changes in one input parameter at a time affect the outcome of a model. It is useful for identifying the impact of individual variables on the result, allowing decision-makers to understand which single factors are most critical.

Multi-way sensitivity analysis explores the effects of simultaneous changes in multiple input parameters on the outcome. This approach helps in understanding the interactions between different variables and how they collectively influence the model's results.

Scenario analysis evaluates the outcomes under different predefined sets of input parameters, representing possible future states or scenarios. It aids in assessing the resilience of a model's outcomes under various hypothetical conditions, offering insights into potential risks and opportunities.

Monte Carlo simulation uses random sampling and statistical modeling to estimate the probability of different outcomes under uncertainty. This method provides a comprehensive view of the potential variability in results, helping decision-makers gauge the range of possible outcomes and the likelihood of various scenarios.

Example 10.4

Problem

To assess the sensitivity of outcomes of the diabetes models as mentioned above, a Monte Carlo simulation was applied to the data. What is the average accuracy over 1,000 simulations?

Solution

Python Code

    import random

    # Define the number of simulations
    num_simulations = 1000
    
    # Initialize a list to store the results
    results = []
    
    # Loop through the number of simulations
    for i in range(num_simulations):
      # Randomly split the data into training and testing sets
      x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=random.randint(0, 1000))
    
      # Create a decision tree classifier
      model = DecisionTreeClassifier(random_state=random.randint(0, 1000))
    
      # Train the model on the training set
      model.fit(x_train, y_train)
    
      # Predict on the testing set
      y_pred = model.predict(x_test)
    
      # Calculate the accuracy
      accuracy = (y_pred == y_test).mean()
    
      # Store the accuracy in the results list
      results.append(accuracy)
    
    # Calculate the average accuracy across all simulations
    average_accuracy = sum(results) / len(results)
    
    # Print the average accuracy
    print(f"Average accuracy over {num_simulations} simulations: {average_accuracy:.2f}")

The resulting output will look like:

Average accuracy over 1000 simulations: 0.75

The average accuracy across the simulations showed an accuracy of 75%. Monte Carlo simulations further enhance understanding by presenting a comprehensive view of the model’s response to varied input distributions, highlighting the model's overall uncertainty and potential outliers. Through such analyses, valuable insights emerge about the housing price prediction model. These might include the predominant influence of location on price, sensitivity to data outliers, or diminishing returns from additional bedrooms beyond a certain threshold. This information becomes crucial in refining the model, guiding data collection strategies, and informing more accurate real estate pricing decisions.

Sensitivity analysis supports decision-making processes across various domains to build an understanding of how different inputs influence the outcomes of their models or projects. By carefully changing important factors and watching how results shift, sensitivity analysis makes it easier to see how different inputs are linked to outputs. This process gives a better understanding of which factors matter the most for the end results. Sensitivity analysis is an important component of the data analysis section of your technical report, demonstrating how robust the models are, thereby strengthening the trust that project sponsors have in your chosen modeling method.

Documenting Strengths and Weaknesses

Every model has strengths and weaknesses. When documenting strengths or weaknesses within the results and discussion portion of a data science report, it is crucial to be clear about the results in a manner that is supported by the data or information that led to the conclusion. The documentation should be in a standalone section in the technical report to draw attention to the matter.

Strengths

Strengths should be specific and meaningful. For example, if the dataset is very comprehensive, you might write: “The dataset includes detailed information on 50,000 housing transactions over the past decade, which provides a robust basis for understanding market trends.” If you want to highlight part of the modeling process that emphasizes accuracy, robustness, etc., you might write “Implementing a random forest model consisting of 1,000 decision trees. A 10-fold cross-validation approach was employed to validate the model performance, ensuring that the results are not overfitted to the training data.”

Overemphasizing or fabricating strengths should be avoided at all costs. Going back to the example of email spam detection, if the model shows 90% accuracy but the training dataset is very imbalanced, with over 95% of messages being non-spam, then it would be misleading to report the 90% accuracy as a strength of the model.

Weaknesses

No one likes to admit weaknesses; however, in the context of a data science project, listing real or potential weaknesses is an essential part of writing an effective and informative report. Weakness statements should be as specific as possible and should indicate steps taken to address the weaknesses as well as suggest ways in which future work might avoid the weakness. Weaknesses may arise from having limited or biased data, datasets that are missing many values, reliance on models that are prone to either underfitting or overfitting, limited scope, or assumptions and simplifications made that could potentially introduce bias and inaccuracy of the model.

An example of an informative weakness statement is: “Approximately 15% of the records had missing values for key variables such as square footage and number of bedrooms, which were imputed using median values, potentially introducing bias.” This provides detailed information (15% of records have missing values) and actions taken to try to address the problem (median values imputed), and there is an implied suggestion that if a more complete dataset could be found, then this weakness could be mitigated.

Using Peer Feedback

Peer feedback for data science reports can strengthen the deliverable by offering diverse perspectives and expert insights, identifying potential weaknesses, biases, or areas for improvement that the original author might have overlooked. For example, a peer might suggest that additional validation techniques be employed to confirm the robustness of a model's predictions, or individuals could point out a more effective method for visualizing complex data. Incorporating such feedback can lead to more accurate and reliable results, enhancing the overall credibility of the report. In addition, peers can help ensure that the report's language and structure are clear and accessible, making it easier for a broader audience to understand and apply the findings. An example of peer feedback might be “Consider using cross-validation to further validate your model's performance and include a comparison with alternative models to provide a comprehensive analysis.”

For those who are providing feedback, it is best to use constructive criticism, acknowledging limitations while proposing potential solutions or future research directions. Instead of outlining weaknesses, emphasize opportunities for improvement, suggesting alternative modeling approaches, data acquisition strategies, or bias mitigation techniques. Providing suggestions for future analyses can lead to advancement on the topic area being analyzed.

Example 10.5

Problem

To better understand consumer behavior and retention, the marketing director of an online clothing retailer requested that the data science team develop a model to analyze customer churn. (Customer churn refers to the rate at which customers stop using a company's products or services over a specific period.) The data science team created a model using logistic regression based on the internal data of the organization. While the model achieved an accuracy of 85%, it exhibited a tendency to misclassify a significant portion of customers who were at high risk of churning based on the results of incoming data. As a supervisor of the data science team, what steps would you take to help the data science team increase accuracy and communicate the current findings to a marketing director?

Solution

Steps to provide constructive feedback among the data science team.

Create a collaborative space and collective approach to examine the factors impacting model.
Acknowledge the limitation by sharing the findings of the most recent data as it compares to the model.
Focus the conversation on the model enhancement, not the team performance.
Support the conversation for enhancement through the provision of suggestions (e.g., incorporate additional customer behavior data, external factors, data science techniques) or questions (e.g., data availability, sources).
Review and agree on the options for next steps moving forward.

Steps to provide constructive feedback to the marketing director.

Describe the dataset that was used to develop the model and also share why the selected data science technique was used without being overly technical.
Share the limitations of the dataset and logistic regression technique.
Present the results of the model with a data visual such as a confusion matrix.
Invite opportunity for the marketing director to provide suggestions for additional data to be used.
Provide suggestions on other data science techniques available based on the recommended data.
Agree on next steps for model development.

10.2 Validating Your Model

LEARNING OUTCOMES:

Stating Assumptions and Justifications

Assumptions vs. Constraints

Problem

Solution

Nonconstant Assumptions

Implicit Bias

Interpreting Measures of Fit

Validation Assessment

Problem

Solution

Validation Techniques

Accuracy and Error

Classification Accuracy

Regression Error Measures

AIC and BIC

Sensitivity Analysis

Problem

Solution

Documenting Strengths and Weaknesses

Strengths

Weaknesses

Using Peer Feedback

Problem

Solution