Learning Outcomes
By the end of this section, you should be able to:
- 8.2.1 Define bias and fairness in the context of data science and machine learning.
- 8.2.2 Identify sensitive data elements and implement data anonymization techniques.
- 8.2.3 Apply data validation techniques, such as cross-validation and outlier detection.
In today's technology, industries produce massive amounts of data every day, and with the improvement in digital resources, analyzing and modeling the delivered data has evolved into an essential part of decision-making in many fields, such as business, health care, education, and government. However, with the control and influence that data analysis and modeling hold, it is crucial to consider the ethical implications of this practice.
Data analysis, defined in Data Science in Practice, is the process of examining, organizing, and transforming data to gather knowledge and make well-informed decisions. Modeling, on the other hand, as defined in Statistical Inference and Confidence Intervals and applied in Statistical Inference and Confidence Intervals and Introduction to Time Series Analysis, involves employing mathematical and statistical methods to replicate real-life situations and anticipate outcomes. Both processes rely heavily on data, and the accuracy of the results depends on the quality of the data used. Ethical considerations arise when using data that has systematic bias (intentional or unintentional) or when sensitive or private data is used in the analysis and modeling process, thereby compromising the data. This section will discuss these issues along with some tools that data scientists can use to help mitigate them, including anonymization, data validation, and outlier detection.
Bias and Fairness
Data scientists need to be aware of potential sources of bias and strive for fairness. (See What Is Machine Learning? for a formal definition of bias.) Bias, whether intentional or unintentional, causes unethical outcomes and may even lead to legal concerns. It is important to proactively address any potential concerns before concluding the project and posting the results. Data science teams must plan and build models addressing all possible outcomes with fairness and equity at the forefront of their minds to ensure the best possible results.
Biases in data can lead to biased analysis and modeling, ultimately resulting in unreasonable decisions. For example, if a company's dataset only represents a specific demographic, the insights and predictions produced will only apply to that demographic. This could lead not only to exclusion and discrimination toward marginalized groups, but also to inaccurate modeling when given data from those groups. It is important to regularly check for bias in the data and address it appropriately.
When developing a model, it is important to estimate how the model might introduce bias or unfairness into its decision-making process to avoid its motivations. Fairness in data science and machine learning is described as the absence of bias in the models and algorithms used to process data. To ensure models are unbiased and fair, developers need to evaluate a variety of characteristics, including data quality and representation, how to break data into training and test groups, how models are trained and validated, and how the forecasted results are monitored and adjusted. In practice, data bias may be due to poor sampling techniques, small datasets, or differences in measurement protocols among different data gatherers, but it could also be due to intentional practices, such as “cooking the books” to make financial data look more favorable for a company, data fabrication or falsification, “cherry-picking” (choosing only data that supports a predefined conclusion and ignoring the rest), and model manipulation to achieve a desired outcome. Motivations behind these practices include financial gain, prestige, and intent to discriminate against other demographic groups.
When data is biased, the resulting models and algorithms may lead to misinformed decisions, inaccurate predictions, or unfair outcomes. A real-world example of how bias in data science methods or data collection can ruin an analysis is seen in the case of the COMPAS algorithm used in the U.S. criminal justice system. COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is an algorithm designed to predict the likelihood of a defendant reoffending (Berk et al., 2021). However, an investigation in 2016 found that the algorithm was biased against Black defendants. The analysis revealed that the algorithm incorrectly flagged Black defendants as high risk at nearly twice the rate as White defendants. Conversely, White defendants were more likely to be incorrectly flagged as low risk compared to Black defendants. This bias in the data collection and algorithm led to unjust sentencing and parole decisions, disproportionately affecting Black individuals and undermining the fairness and accuracy of the judicial process. This case underscores the critical need for rigorous evaluation and mitigation of bias in data science applications to ensure fair and equitable outcomes. An interesting use of AI using the open-source Python library FairLens to further analyze this dataset can be found at this OECD policy website, though any such application must be viewed with caution.
The data scientist’s primary goal is to construct models that can accurately and fairly depict the population of interest by measuring and mitigating potential bias as much as they can in the datasets used to prepare the models and implementing fairness into the algorithms used to analyze the data. This includes objectively testing the algorithm design, using standard protocols for data collection, ensuring proper handling of data, and making sure that any decisions made in the modeling process are based on improving the model, not manipulating the model to achieve a desired outcome. The use of available training datasets (see Decision-Making Using Machine Learning Basics) is crucial in detecting discrepancies or inequalities that could lead to unfair results. Models and algorithms should be tested frequently for potential and any expected bias to ensure continuous fairness throughout the model’s serviceability.
Factors That Might Lead to Bias
The following factors might cause bias in the process of conducting data science projects:
- Outliers. An outlier might cause an algorithm to drastically change its outcomes due to its abnormality compared to the rest of the data.
- Unrepresentative samples. If the data employed to train an algorithm lacks significant representations of a particular group, then that algorithm will not be qualified to properly serve that group.
- Imbalance of classes. If the data factors are weighted disproportionately to one class group, then the algorithm may be biased toward that particular group.
- Limited features. If there are not enough features to represent the data accurately or completely, the algorithm will likely have an incomplete understanding of the data, leading to inaccurate outcomes or biased decisions.
- Overfitting. Overfitted models are trained on specific data models that cause them to be unable to generalize the data that they have not seen before, causing bias.
- Data quality. Poorly formatted data, inaccurately labeled data, or misplaced data can cause an algorithm to base its decisions on irrelevant or invalid information, resulting in a bias.
Factors That Generally Lead to Fairness
The factors that yield fairness throughout a data science project process include the following:
- Representative data. The data used to train and test models should represent the entire population it seeks to address to ensure fairness.
- Feature selection. Careful consideration of the features utilized by a model will ensure that aspects unrelated to the outcome, such as gender or race are not used for prediction.
- Interpretability. Understanding how outcomes are related to the features of the data can help determine if bias is present and lead to corrections. This may be difficult if the algorithm is very complex, especially when the model involves neural networks and AI. Explainable AI (XAI)—a set of processes, methodologies, and techniques designed to make artificial intelligence (AI) models, particularly complex ones like deep learning models, more understandable and interpretable to humans—facilitates interpretability in this case.
- Human oversight. Whenever practical, humans should observe and evaluate how the model is performing. If the dataset is very large, then samples can be pulled and examined for consistency.
- Metrics. Certain measurements or tests can be performed on the data and the model to help ensure that expectations are met and any bias is identified and handled.
Example 8.6
Problem
Consider a loan approval system that uses an algorithm to determine an individual's creditworthiness based on their personal and financial information. The algorithm is trained to analyze various attributes such as income, credit score, employment history, and debt-to-income ratio to make a decision on whether or not to approve a loan for each individual based on their likelihood of paying it back. Two datasets are available for training. Which one would be most appropriate in reducing potential bias in the model?
- Dataset 1, containing information from 5,000 college graduates
- Dataset 2, containing information provided by the first 200 people who passed by the bank during the previous week
- Dataset 3, containing information from 3,500 individuals who were selected from different socioeconomic backgrounds, genders, ages, and demographic groups
Solution
The correct answer is C. The most appropriate training set for this system would be Dataset 3 (C), as it includes a broadly representative range of individuals. While Dataset 2 (B) may also represent a diverse range of individuals, the dataset is likely too small to be of much use. Dataset 1 (A) is not appropriate, as the system would then be biased to perform well only on college graduates to the exclusion of other groups of individuals.
Potential Misuse of Data Analysis and Modeling
The potential for misuse of data analysis and modeling is another ethical concern. If data is misapplied or used inaccurately for something other than the project purposes, the predictions produced through data analysis and modeling may manipulate or hurt individuals or groups.
Consider a health care startup that develops a predictive model to identify patients at high risk of hospitalization due to chronic illnesses. To create the model, the company collects a wide range of sensitive patient data, including medical history, medication records, and biometric information. The company fails to implement robust data protection measures, leading to a breach in which hackers access the patient data. The sensitive nature of this information makes its exposure highly concerning, potentially leading to identity theft or discrimination against affected individuals. This example highlights the risks associated with using private or sensitive data in analysis and modeling, emphasizing the importance of strong data security practices and compliance with data protection regulations.
As mentioned earlier in this section, it is essential to have ethical guidelines and regulations in place to prevent such misuse. In addition, there is a growing concern about the ethical implications of automated decision-making systems. These systems use data analysis and modeling to make decisions without human intervention. While this can save time and resources, it also raises questions about fairness and accountability. Who is responsible if the system produces biased or discriminatory results? It is crucial to have ethical guidelines and regulations in place, including human oversight at all stages of the project, to ensure these systems are transparent, fair, and accountable.
Moreover, there is a responsibility to continuously monitor and reassess the ethical implications of data analysis and modeling. As technology and data practices evolve, so do ethical concerns. It is important to regularly review and adapt ethical standards to ensure the protection of individuals' rights and promote responsible and ethical rules of data usage.
While data analysis and modeling have the power to provide valuable insights and facilitate decision-making, it is crucial to consider the ethical implications surrounding this practice. Transparency, accountability, fairness, and the protection of individuals' rights should be at the forefront of any data analysis and modeling process.
Data Anonymization
As discussed in Ethics in Data Collection, the process of removing or modifying personally identifiable information is called data anonymization or pseudonymization, respectively. Anonymized data can usually still be utilized for analysis without compromising the individual's identity. Pseudonymization of data involves masking, encoding, or encrypting the values of specific features of the data. Features such as names or Social Security numbers can be exchanged with artificial values. If needed, these values can be replaced later using a secure lookup table or decryption algorithm. A common method of pseudonymization is called hashing, which is the process of transforming data into a fixed-length value or string (called a hash), typically using an algorithm called a hash function. The key property of hashing is that it produces a unique output for each unique input (within practical constraints) while making it infeasible to reverse the transformation and retrieve the original input from the hash.
As an example, at a particular university, student records are stored securely on a private server. From time to time, student workers need to pull grade information from the server to perform analysis; however, there is a strict policy in place mandating that students may not have access to personally identifiable information of fellow students. Therefore, the university’s information technology division has devised a coding scheme in which the last name and student ID are combined into a codeword, which may be used to access that student’s grade information from the server. This is an example of pseudonymization, as the PII is encoded rather than removed from the dataset.
Robust data anonymization should follow a policy of k-anonymization, which is the principle of ensuring that each record within a dataset is indistinguishable from at least other records with respect to a specified set of identifying attributes or features. This approach helps reduce the risk of re-identifying individuals in a dataset, allowing organizations to protect clients and their information without restricting their ability to use the data to gain insight and make decisions.
Data anonymization policy (including the determination of k in k-anonymization if needed) should be set by the regulatory compliance officer (RCO) or other compliance professionals with specialized knowledge of data privacy regulations. These professionals must be familiar with prevailing data privacy regulations and the data masking process to ensure that all possible security measures comply with these regulations. It is also important that all data masking processes use protected techniques to ensure that data can’t be accessed or operated in an unauthorized manner.
Example 8.7
Problem
Mike is a data privacy officer for his local school district. The district is looking to collect data from parents to gain insight into their opinions on the district's curriculum. However, Mike is concerned about protecting the privacy of the parents and their children. He knows that by including too much identifying information in the survey, it could potentially allow someone to re-identify the respondents. What steps can Mike take to protect the privacy of the respondents while still collecting useful data for analysis?
Solution
Mike decides to implement k-anonymization. To determine the appropriate value for k, Mike looks at the total number of respondents and the information collected in the survey. He knows that the more identifying characteristics he removes, the higher the level of anonymity will be. However, he also wants to make sure that the data is still useful for analysis. After some research, he determines that a minimum of 5 respondents should have the same characteristics in order to achieve a satisfactory level of anonymization.
After reviewing the survey questions, Mike sees that there are 10 potential identifying characteristics, such as the number of children, their school district, and their ages. To calculate the k-anonymization, he takes the total number of respondents and divides it by 5. For example, if there are 100 respondents, the k-anonymization would be . This means that for each set of characteristics, there must be at least 20 respondents who share the same values. Mike implements this rule by removing or encrypting some of the identifying characteristics in the survey. For example, instead of asking for the exact age of the children, the survey will only ask for their age range. He also removes the specific names of schools and instead groups them by district. By doing so, he ensures that there are at least 20 respondents who share the same characteristics, thus achieving the k-anonymization.
Data that is under intellectual property rights, such as copyright, can also be anonymized as long as the intellectual property rights holder has given permission to do so. Anonymization of data is useful for maintaining privacy in sensitive data while also preserving the intellectual property rights of the original data. However, certain precautions should be taken to ensure that the data is not used for any unethical or illegal purposes.
The process of data anonymization generally follows a few common steps.
- First, it is important to determine which data features need to be anonymized and which do not.
- Once the data elements have been identified, the next step is to develop a technique to mask the sensitive data and remove the identifying elements.
- Tools such as hashing and encryption can be used to anonymize the data.
- It is important to ensure that all techniques used for data masking are secure and compliant with current privacy regulations.
- The data should then be tested and observed to ensure that masking processes are implemented accurately.
- For added security, data anonymization should also involve k-anonymization.
- Finally, access management must be in place to ensure that only those with the right privileges can access the data.
An example of masked data:
Luke Smith, birthdate 1997, Social Security 567-89-1234, and address 123 Main Street becomes "LS97, XXX-XX-XXXX, XXX Main Street."
Here we see that the name and birthdate are hashed into a four-character code, “LS97,” while the Social Security number and street address number are masked completely, effectively removing those features completely.
Data Validation
Data validation (as introduced in Collecting and Preparing Data) is the process of checking, verifying, and validating the accuracy and reliability of data before it is used in decision-making. When both data validation and ethical rules are applied, the models and analyses that use the data are more reliable and effective.
Data validation provides a set of checks and controls that help ensure that data meets certain standards for accuracy and reliability. It also helps ensure that data is handled properly and follows ethical policies and principles. By establishing processes and procedures for data validation, companies can demonstrate a commitment to ethical data usage, which is an important component of any data governance policy. Effective data validation will secure data from the influence of bias, prejudices, or preconceptions. As mentioned earlier, unbiased data is essential for ethical modeling, as it gives an accurate illustration of a real-world situation rather than being influenced by personal intentions or opinions held by certain individuals and ensures that decisions are made on solid facts and evidence instead of subjective opinion.
The process of data validation can be divided into three key areas: outlier detection, cross-validation, and ethical guidelines.
Outlier detection concerns the identification of observations that are significantly different from the rest of the data and ensures that the data is free from outliers that can affect the results of the models and analysis. If outliers are detected, they may either be removed or imputed to have more reasonable values.
Cross-validation is a process used to evaluate a set of data, and it involves the comparison of the results with different subsets of the data or with the entire dataset by repeatedly breaking the data into training and testing sets and evaluating the model's performance on different subsets of the data. If the results vary too much on different subsets, then the model may have been over-fit, leading to biases and inaccurate predictions when used on new data.
Ethical guidelines for data validation cover a broad range of protocol and policy intended to inform when and how such tools as cross-validation and outlier detection and handling should be used.
There are overlaps and interactions between these three key areas of data validation. Ethical rules may influence the choice and implementation of outlier detection methods to ensure fairness and avoid discrimination. Cross-validation techniques can be used to assess the robustness of outlier detection algorithms and evaluate their performance across different subsets of the data, and ethical considerations may guide the interpretation of cross-validation results, especially when assessing the impact of outliers on model performance and fairness.
While ethical rules or guidelines, cross-validation, and outlier detection serve distinct purposes in data validation, they work together to ensure the reliability, fairness, and robustness of data-driven models and analyses. The intersections among ethical rules, cross-validation, and outlier detection is illustrated in Figure 8.4.
Exploring Further
The Environmental Impact of Data Science
This section has focused on the ethics of data and how it is used in modeling and data science. But did you know that data science also carries with it a substantial environmental impact? As datasets grow larger and modeling techniques, including AI, get more and more complex, vast arrays of computer servers are required to do the analysis and number crunching. The sheer number of computer servers, along with all the technology and devices that support them, greatly raises the carbon footprint of data science. The increased use of energy directly affects greenhouse gas emissions, which contributes to climate change and depletes valuable resources, leading to increased inequalities around the world. Global electricity consumption related to information and communication technology (of which significant support goes to data science tasks) is estimated to account for between 1 and 1.5% of all global electricity demand, and greenhouse gas emissions from these sources are expected to exceed 14% of global emissions of greenhouse gasses by 2040 (Nordgren, 2023). The environmental effects of data science must be considered alongside other ethical issues if we hope to maintain a sustainable relationship with the Earth.