Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

8.1 Ethics in Data Collection

Learning Outcomes

By the end of this section, you should be able to:

8.1.1 Discuss data protection and regulatory compliance considerations in data science.
8.1.2 Explain the importance of privacy and informed consent in data science.
8.1.3 Identify data security and data sharing benefits and risks.

Data collection is an essential practice in many fields, and those performing the data collection are responsible for adhering to the highest ethical standards so that the best interest of every any party affected by the project is maintained. These standards include respecting individuals’ privacy, accurately representing data, and disclosing collection intentions. First and foremost, it is important to ensure that data is used appropriately within the boundaries of the collected data’s purpose and that individual autonomy is respected—in other words, that individuals maintain control over the decisions regarding the collection and use of their data.

For example, if someone is using a fitness tracking app, they may want to track their daily steps and heart rate but not their location or sleep patterns. They should have the autonomy to choose which data points they want to collect and which ones they do not. Similarly, individuals should also have the autonomy to choose the method of data collection they are most comfortable with. For some, manually inputting data may be preferable, while for others, using a smart device to collect data automatically may be more convenient. In addition, individuals should have control over these decisions—they should be able to manage their own personal data and make informed choices about what they are comfortable sharing. This promotes transparency and empowers individuals to fully participate in the data collection process.

To ensure ethical data collection, autonomy of personal data, and data security, it is essential to consult and adhere to the industry standards in data science as developed by organizations such as IADSS Initiative for Analytics and Data Science Standards), DSA (Data Science Association), and ADaSci (Association of Data Scientists). In addition, one must be fully aware of governmental regulations and industry standards.

Regulatory Compliance

Before beginning any data science project, it is important to understand the regulations and specific state and federal laws as well as industry and professional guidelines that apply to the type of data gathering required. Regulations protect the privacy and security of personal and confidential data as well as secure its accuracy, integrity, and completeness. Such regulations may vary from country to country or even from state to state within a country. In the United States, the US Privacy Act of 1974 regulates how federal agencies may collect, use, maintain, and disseminate personal information. This law was later supplemented by the E-Government Act of 2002, extending protections to data held digitally rather than physically. A good summary of these laws and other privacy regulations can be found at the GAO’s Protecting Personal Privacy site.

Industry and Global Standards

Data privacy compliance requirements vary depending on the industry and the standards or laws that govern the organization’s operations. Failure to comply with these requirements can result in penalties and legal consequences for the organization, so data scientists in these industries need to be aware of and adhere to these regulations to avoid any potential repercussions.

Data Security Regulations are based on data protection laws, such as the General Data Protection Regulation in the European Union (GDPR) and the 2018 California Consumer Privacy Act (CCPA) in the United States, with its amendment, CPRA (2023). These regulations demand that organizations collect, process, and accumulate confidential data securely and transparently. Figure 8.2 and Figure 8.3 summarize the GDPR and CCPA/CPRA principles. This GDPR Checklist for Data Controllers on the EU website presents the principles in an actionable format for organizations.

A diagram illustrating the General Data Protection Regulation (GDPR) and its key principles. A blue box labeled “GDPR” connects with lines to seven colored bars. The text in the bars, from top to bottom, reads: lawfulness, fairness, and transparency; purpose limitation; data minimization; accuracy; storage limitation; integrity and confidentiality; and accountability.

Figure 8.2 GDPR Principles

A diagram with a blue box labeled CCPA and a pink box labeled CPRA are connected to each other with a line. The blue box is connected to four more subsidiary blue boxes on the right. From top to bottom, those boxes read: The right to know about the personal information a business collects about them and how it is used and shared; The right to delete personal information collected from them (with some exceptions); The right to opt out of the sale or sharing of their personal information; and The right to non-discrimination for exercising their CCPA rights. The main pink box on the left is connected to two subsidiary pink boxes that read from top to bottom: The right to correct inaccurate personal information that a business has about them; and the right to limit the use and disclosure of sensitive personal information collected about them.

Figure 8.3 CCPA/CPRA Principles

Some countries have laws regarding data sovereignty that require data collected from their citizens to be stored and processed within their borders, making it important to understand where data is being collected and stored. Many countries have also enacted laws and regulations involving data retention, or how long personal data may be stored (e.g., the Sarbanes-Oxley Act [SOX] in the United States or the Data Protection Act [DPA] in the United Kingdom); data breach notification, or what to do when data is stolen by a malicious third party (Gramm-Leach-Bliley Act [GLBA] in the United States or the Personal Information Protection and Electronic Documents Act [PIPEDA] in Canada); and many other issues of data privacy, security, and integrity.

Some regulations apply only to certain types of data or data related to certain industries. For example, in the United States, the Health Insurance Portability and Accountability Act (HIPAA) requires the safeguarding of sensitive information related to patient health. Financial institutions in the United States must adhere to the regulations set forth by the Financial Industry Regulatory Authority (FINRA) and the Securities and Exchange Commission (SEC), both of which regulate the use of sensitive financial information. The Family Educational Rights and Privacy Act (FERPA) provides protections for student educational records and defines certain rights for parents regarding their children’s records.

Security and Privacy

Adhering to regulatory compliance standards from the beginning stages of the project ensures that all relevant data is gathered and preserved properly, abiding by legal and regulatory requirements. Data security measures consist of all the steps taken to protect digital information from unauthorized access, disclosure, alteration, or destruction, ensuring confidentiality, integrity, and availability. It is the responsibility of the data scientist to ensure that all appropriate data security measures are being taken.

Based on the project's purposes, data may need to be either anonymized or pseudonymized. Anonymization is the act of removing personal identifying information from datasets and other forms of data to make sensitive information usable for analysis without the risk of exposing personal information. This can involve techniques such as removing or encrypting identifying information. Pseudonymization is the act of replacing sensitive information in a dataset with artificial identifiers or codes while still maintaining its usefulness for analysis. These techniques may be required to comply with data protection laws.

To further comply with regulations, some data components may need to be isolated or removed from datasets; for example, industry-specific information may need to be undisclosable from other datasets, especially in the medical or financial domain. Ultimately, adhering to applicable regulatory provisions throughout the data science cycle ensures that all relevant information and data are collected and stored in a way that meets its specific needs without violating any laws or regulations.

The importance of adhering to regulatory compliance cannot be overstated when it comes to ethical business practices, especially when dealing with sensitive customer data such as personal health information and financial records. Firms are obligated to ensure all client information is secure and treated with the utmost care by following all laws and regulations related to data security and privacy. These policies typically include a written statement of data security and privacy requirements, clear definitions for handling customer data, a list of approved and unauthorized uses of data, and detailed procedures for handling, storing, and disposing of customer data. Companies should demonstrate a commitment to ethical practices and ensure customers feel safe by providing sensitive information to the business. These practices assist in building trust and strengthening the relationship between the business and its customers. In addition, following all applicable regulations and laws enables the protection of businesses from potential legal action from customers, regulators, or other third parties.

Regulatory Compliance Teams

Regulatory compliance is commonly controlled by legal and compliance teams or by external consultants for distinct industries. Compliance responsibilities are also often shared across departments such as finance, operations, sales, marketing, and IT. Firms ensuring successful compliance with laws can differ greatly depending on the industry and the size and complexity of the organization. In general, corporations develop procedures and policies designed to ensure that adherence conditions are met. These processes and policies may include frequent training, logging, and creating a system of checks and balances to ensure compliance. They may also include more refined measures, such as periodic reviews of business operations and processes or risk analysis and simulations.

Ultimately, an organization’s regulatory compliance officer is responsible for ensuring that the data science project follows all relevant regulatory needs. A regulatory compliance officer (RCO) is a trained individual responsible for confirming that a company or organization follows the laws, regulations, and policies that rule its functions to avoid legal and financial risks. The RCO is entrusted with the responsibility of identifying potential compliance issues, devising and implementing effective policies and procedures to address them, and providing comprehensive training and guidance to employees. Additionally, they conduct thorough investigations to assess and monitor compliance levels and promptly report any instances of noncompliance to management. To that end, the RCO will typically use a combination of regulatory aids, internal and external assessments, and business process mapping to assess the project’s compliance with applicable laws and regulations. If any deficiency in data privacy or security is found, the RCO will work with the project team to develop a plan for compliance.

Example 8.1

Problem

A retail company, Gadget Galaxy, is launching a new customer study project to analyze customer purchasing behavior. The company has encountered several mistakes in its data collection process that violate regulatory compliance standards. Which of these action plans should this company follow if it is found to have violated regulatory compliance?

Gadget Galaxy can outsource its data collection process to a third-party company specializing in regulatory compliance.
The mistake made by Gadget Galaxy during its data collection process is not implementing any policies or procedures to protect the privacy and security of customer data. Therefore, Gadget Galaxy can continue with its project without implementing any policies or procedures as it is a small business and is not required to comply with regulations.
The project team has conducted simple training on regulatory compliance for handling customer data. Therefore, the project team can do their own research and create policies and procedures for data privacy and security.
Gadget Galaxy must halt the project and seek guidance from a regulatory compliance expert to develop appropriate policies and procedures.

Solution

The correct choice is D. Gadget Galaxy must halt the project and seek guidance from a regulatory compliance expert to develop appropriate policies and procedures. The project team should undergo training on regulatory compliance to ensure team members understand and observe these policies and procedures. An external review or risk analysis should also be conducted to identify any potential compliance gaps and address them accordingly.

Privacy and Informed Consent

Several important concepts are embedded in the regulations governing data science collection. These include transparency through informed consent, allowing individuals to review and update any personal information. They also rely on safeguarding secure storage and access protocols and regularly monitoring and revising any associated policies and procedures as necessary. Based on ethical principles of study and as a form of legal protection, informed consent needs to be obtained from project participants. Informed consent is a process of obtaining permission from a research subject indicating that they understand the scope of collecting data as well as the associated potential risks, benefits, and consequences, including how the data will be used and how it will be stored.

Informed consent consists of disclosure, or providing full and clear information about the activity, study, survey, or data collection process; understanding, or ensuring that the participant is able to comprehend the information; voluntariness, or absence of any coercion, pressure, or undue influence; and finally consent, or agreeing to or approving of the process.

For example, a data scientist, Sarah, is conducting research for a company that collects user data through a mobile app. The company requires informed consent from the app users before collecting any data. Sarah decides to conduct a survey to obtain consent from the users. The research aims to understand user behavior and usage patterns in order to improve the app experience. Informed consent is required to ensure transparency and compliance with regulations. By constructing an easy-to-read survey, Sarah is disclosing information and ensuring understanding. Participants take the survey voluntarily, and the results of the survey will indicate consent or non-consent.

Cookies

After you have searched for a product on the internet, do you notice that you suddenly start receiving advertisements for that product or that a store website now displays that product on its front page? You can thank cookies for that. Cookies are small data files that are deposited on users’ hard disks from websites they visit. They keep track of your browsing and search history, collecting information about your potential interests to tailor advertisements and product placement on websites. Cookies can be either blocked or accepted depending on a company’s privacy policy. The potential risk of cookies is that they can store information about the user, user preferences, and user browsing habits. That said, they can also save time by storing users’ log-in information and browsing preferences, allowing internet pages to load faster than if you had loaded them the first time. Regardless of convenience, it is a good idea to clear cookies from time to time and to restrict cookies on certain sites depending on your own preferences. Does the use of cookies by online websites violate any of the principles of informed consent? What about websites that will not load unless the user agrees to their cookie policy?

Confidentiality refers to the safeguarding of privacy and security of data by controlling access to it. The danger is that personally identifiable information (PII), which is information that directly and unambiguously identifies an individual, may end up in the wrong hands. Anonymous data is data that has been stripped of personally identifiable information (or never contained such information in the first place) so that it cannot be linked back to any individual, even if it were improperly accessed. Confidentiality plays an important role in the informed consent process, as individuals providing informed consent must be assured that their data will be kept private. This means that any information disclosed during the process of obtaining informed consent should not be shared with unauthorized individuals or organizations. Furthermore, the right to confidentiality should be ongoing, and confidential data should be disposed of properly after the study has been concluded.

An example of a violation of the privacy of an individual or business in the data collection stage may occur when data scientists collect data without pursuing informed consent. Other examples of possible privacy violations during the data collection stage include collecting sensitive information such as health data without explicit consent (as per HIPAA introduced earlier in the chapter), data from minors without their parent’s approval, and data from an individual or group that may be vulnerable or marginalized without proper notification or protection measures in place. Those who work in higher education need to be aware of FERPA (Family Educational Rights and Privacy Act), which restricts the sharing or obtaining of college students’ academic information without prior consent, even if the requesting party is the student’s parent or guardian.

Example 8.2

Problem

A group of data scientists is contracted by Gadget Galaxy to conduct a study on consumer buying patterns. They begin by collecting private information from respective participants. The team has implemented an informed consent process to ensure ethical and transparent data collection. This process includes a consent form outlining the study's scope, potential risks and benefits, data handling and storage, and the ability for participants to review and update their personal information. Participants can review the terms at any point during the study, provide ongoing consent, and decide if their data can be transferred to a third party.

Which of the following is NOT a key aspect of the informed consent process described?

The consent form outlines the study's scope and potential risks and benefits.
Participants can review and update their personal information.
Participants' consent is limited to a single point in time.
Participants have the option to give or withhold their consent for data transfer to a third party.

Solution

The correct choice is D. This is not an aspect of informed consent. The informed consent process described emphasizes that participants’ consent should be ongoing and not limited to a single point in time. All other options are key aspects of the informed consent process.

Data Security

Strong data security policies and protocols are critical parts of any organization in order to mitigate the risk of data breaches throughout the development process. In addition, institutions should have a strategy ready for responding to data breaches in an effective and timely manner. It is important for organizations to ensure that any data they collect or use has been obtained from secure and verified sources and that any data that is shared is done so securely.

Data breaches can have far-reaching consequences, from financial losses to diminished public trust in organizations. As an example, consider the 2017 Equifax data breach. Equifax, one of the three major credit reporting agencies in the United States, experienced a massive data breach in 2017. This breach exposed the personal information of over 145 million Americans (near half the total U.S. population) along with over 15 million British citizens and a smaller number of people from other countries, making it one of the largest and most severe data breaches in history. The data that was compromised included names, Social Security numbers, birth dates, addresses, driver’s license numbers, and for a small percentage of individuals, credit card information. The exposed information put millions of people at risk of identity theft and fraud. Criminals could use the stolen data to open new credit accounts, file fraudulent tax returns, and commit other forms of identity fraud. Victims of the breach had to spend significant time and money to protect their identities, such as by freezing their credit reports, monitoring their credit, and dealing with fraudulent charges. This case highlights the importance of data security in our increasingly digital world, where large-scale data breaches can have widespread and severe consequences for individuals and organizations alike.

If collected data is particularly sensitive or confidential, then it may require encryption, which is the process of converting data into code that would be very difficult or impossible for a third party to decipher. There are many encryption algorithms, making use of sophisticated mathematical algorithms, that are routinely used in industries such as finance, health care, education, and government agencies; however, the details of encryption fall outside the scope of this text.

Exploring Further

Cryptocurrencies

Bitcoin and other cryptocurrencies use encryption to secure transactions and maintain the integrity of the blockchain, the decentralized ledger that records all transactions. Encryption methods, such as public-private key cryptography, ensure that only the rightful owners can access and transfer their digital assets, creating a secure and verifiable system for storing value in a purely digital form. This cryptographic security underpins the trust and value associated with cryptocurrencies and provides a way to “store value” in the otherwise intangible medium of pure data. For more about bitcoin and other cryptocurrencies, check out Riverlearn’s “How Bitcoin Uses Cryptography” and Kraken’s “Beginner’s Guide to Cryptography”.

Intellectual Property

Another form of data protection that typically does not require encryption is that of legal protection of intellectual property. Intellectual property, broadly defined as original artistic works, trademarks and trade secrets, patents, and other creative output, is often protected from copying, modification, and unauthorized use by specific laws. For example, in the United States, copyright protects original creative works for the life of the creator plus 70 years. Any protected data under intellectual property rights can be accessed legitimately, if proper permission from the holder of the copyright is obtained and proper attribution given. More about copyright and attribution of data will be discussed in Ethics in Visualization and Reporting.

Some kinds of data are publicly available. Even when this is the case, data scientists need to be aware of the nature of the data to ensure it is used appropriately. For example, if publicly available data contains a person's medical information or financial information, this would still pose a data security risk. Organizations should always ensure that any data they make publicly available is secure and protected by their data security policies and protocols. If data is accidentally breached during the project development process (thereby inadvertently making it public data), this could have serious and damaging consequences. For example, in 2019, two datasets from Facebook apps were breached and exposed to the public internet, revealing phone numbers, account names, and Facebook IDs of over 500 million users. Now that the data has been released, there is no way to ensure its security going forward.

Security for Quantitative and Qualitative Data

The collection of quantitative and qualitative data (defined in Data and Datasets) have different ethical data security conditions. Securing quantitative data typically involves encryption and anonymization among other techniques. It is essential to maintain the integrity and accuracy of quantitative data, as errors or tampering can lead to incorrect conclusions. Qualitative data collection may also be secured with encryption and anonymization, but there is often a greater emphasis on creating and maintaining trust between the data scientist and respondent, as qualitative data can be context-rich, containing identifiable information, confidential details, and personal data, even when names are removed.

Finally, training and awareness of team members who collect, use, and analyze sensitive data are crucial for ensuring data security. Anyone who works with data must be in compliance with legal and ethical standards and help to maintain the trust of individuals whose data is being handled. In a significant number of the cases of data breaches, the team members themselves were not thoroughly vetted or trained. For more on how businesses may be better prepared to handle data and respond to data breaches, see the Federal Trade Commission’s Data Breach Response: A Guide for Business.

Example 8.3

Problem

Consider a data science team working on a revolutionary data science project that would have a huge impact on the health care industry. The project involves gathering and analyzing sensitive medical information from patients, including full names, addresses, phone numbers, and medical histories. The team recognizes the critical importance of maintaining data security to ensure the success and ethical integrity of their work. One day, a request for access to the data is received from a researcher from an agency outside of the team who has heard about their work. What steps should be taken to ensure data security?

Solution

This data contains personally identifiable information (PII). Moreover, the team needs to ensure that the project complies with HIPAA regulations because there are medical records involved. The following measures should be taken at the beginning of the project and throughout the collection stage.

Review access control policies: The first step in granting access to any sensitive data is to review the access control policies in place. This includes understanding who has access to the data, what level of access they have, and how this access is granted. This will help determine if the researcher should be granted permission and what level of access they should have.
Authenticate the request: Before granting access, the team should verify the authenticity of the request. This can be done by confirming the identity of the requester and their authority to access the data. The researcher must provide proof of their affiliation with a legitimate organization and the purpose of their request.
Assess need-to-know: The team should evaluate whether the researcher has a legitimate need to know the data. If the data is not relevant to their research or can be obtained through alternative means, access may not be granted. This step ensures that only authorized parties have access to the sensitive data.
Obtain consent from data owner: As the data pertains to confidential medical information of patients, it is important to obtain consent from the data owner before granting access to a third party. This can be done through a formal process where the data owner signs off on granting access to the researcher.
Use secure methods for data transfer: If access is granted, the team must ensure that the data is transferred using secure methods to protect it from any potential breaches. This can include using encryption and secure file-sharing platforms.
Review for due diligence: Once access is granted, the team should continue to monitor and review the usage of the data to ensure that the researcher is only accessing the data for approved purposes. Any breaches or misuse of the data should be reported and addressed immediately.
Revoke access when necessary: Access to sensitive data should only be granted for a specified period of time and should be revoked when it is no longer needed. If the researcher's authorization or employment changes, their access should be revoked immediately to prevent any potential data breaches.

In general, data scientists should start by carefully considering the demand for data collection within the data science task. They should ensure that the data being collected is necessary and relevant to the project. Next, they should understand the types and quantity of data being collected to determine the level of security measures needed for different types of data.

Example 8.4

Problem

A data scientist is leading a project for a construction company that involves extensive data collection. The team carefully assesses the data requirements and implements measures to encrypt the stored data during transmission to safeguard against potential breaches. As the project progresses, the team becomes aware of the importance of obtaining appropriate authorization from copyright holders for any data protected by intellectual property laws. The team prioritizes ethical responsibility by properly attributing all data used and seeking permission from third parties when necessary. However, despite proactive measures, the team acknowledges the ongoing risk of data breaches during the development process. What should the data scientists do if they encounter a data security breach during the development process?

Solution

A possible data security breach can only be handled if the team already has a response plan in place. Having a response plan also shows that data scientists are prepared to handle any potential issues that may arise, can minimize the damage caused by a breach, and can prevent further breaches from occurring. The steps to be taken should include the following:

Identify and contain the breach: The first step is to identify the type and scope of the breach. This may involve investigating any unauthorized access, stolen or lost devices, or other security incidents. Once identified, the breach must be contained to prevent further exposure of sensitive data.
Assess the damage: Once the breach is contained, the team must assess the extent of the damage caused. This may involve determining the type and amount of data that was compromised as well as any potential impact on affected individuals.
Notify authorities: Depending on the nature of the breach, it may be necessary to notify authorities such as law enforcement or regulatory agencies. This is especially important for breaches involving personally identifiable information (PII) or sensitive data.
Notify affected individuals: The team must also inform any individuals whose data was compromised in the breach. This notification should include details of the breach, the type of data exposed, and any steps the individual can take to protect themselves.
Secure affected accounts: If the breach involves compromised user accounts, those accounts must be secured immediately. This may involve resetting passwords or disabling accounts.
Review and update security protocols: The team should conduct a thorough review of their current security protocols and make any necessary updates or improvements to prevent future breaches.
Communicate with all project participants: It is important to keep customers, partners, and employees, informed about the breach and the steps being taken to address it. This will help maintain trust and mitigate any potential damage to the organization's reputation.
Monitor for further breaches: The team should continue to monitor for any further breaches or suspicious activity to prevent any additional data exposure.
Conduct a post-breach review: After the data breach has been resolved, the team should conduct a post-breach review to identify any weaknesses or vulnerabilities in their security protocols and take steps to address them.
Provide support for affected individuals: The team should also provide support for affected individuals, such as offering credit monitoring services or assistance with any financial or identity theft issues that may arise from the breach.

Data Sharing

Data sharing is the process of allowing access to or transferring data from one entity (individual, organization, or system) to another. It is done for a variety of reasons, including collaboration, research, providing services, improved forecasting, and jumpstarting innovation. Data sharing can be carried out in a variety of ways and may involve different types of data. Organizations and individuals may use public datasets, open data standards, or data pools to share data. But most importantly, they must establish data governance protocols—a set of rules, policies, and procedures that enable precise control over data access while ensuring that it is safeguarded. For data sharing to be successful, organizations and individuals are required to establish strong infrastructure management and protocols. It is crucial to consider all perspectives of the project participants and how they will operate the data. Consider a policy that outlines the steps for data classification. The policy defines what types of data are sensitive or confidential and how they should be handled. It also specifies who is responsible for managing and accessing different types of data and what security measures need to be in place to protect it.

What and How Much Should Be Shared?

The scope of the data science project, along with associated requirements, will determine the type and extent of data that can be shared. This data may encompass quantitative, qualitative, tabular, text-based, image-based, and audio-based formats, among others, each of which may require unique methods of sharing. For example, tools like Tableau or Power BI can be used for sharing tabular data and visualizations, while GitHub has traditionally been useful for sharing programming code, datasets, and related materials (though today, anything can be shared on GitHub). As introduced in Python Basics for Data Science and discussed through this book, Google Colab, which is connected to Google Drive, allows users to collaborate remotely on coding tasks. Files may be hosted on Google Drive or Microsoft OneDrive, allowing lead data scientists to set different levels of permissions relevant to each team member or outside entity.

Those directly involved in the project, including team members, customers, researchers, and the project’s eventual audience, should be granted access to the data in accordance with the intended purpose. If appropriate, and with no privacy-related implications, the data may be disseminated to the broader public through platforms like Kaggle, which is designed for open data sharing in a research context. To protect data privacy and integrity when sharing information with outside parties, ethical policies should be implemented, and appropriate security measures should be put in place. Clear guidelines and procedures must also be provided to ensure that the data is used correctly and only for its intended purpose.

Who Should Have Access to Data?

Access to data in a data science project should be given to individuals who are directly involved in the project and have a legitimate need for the data. This may include the following:

Data scientists and analysts. These are the core team members who are responsible for analyzing and interpreting the data to draw insights and make decisions.
Data engineers. These individuals are responsible for managing and maintaining the data infrastructure and ensuring that the data is accessible and stored securely.
Project managers. They need access to the data to monitor the progress of the project, make informed decisions, and allocate resources effectively.
Data owners. These are the individuals or teams responsible for collecting, processing, and storing the data. They have a deep understanding of the data and can provide valuable insights.
Interested parties. These are individuals who have a business need for the data and will use the insights to make decisions and drive business growth.
Legal and compliance teams. They ensure that the data is accessed and used in compliance with laws, regulations, and company policies.
Data privacy specialists. They ensure that sensitive data is appropriately handled and protected to maintain the privacy of individuals.
External partners or clients. In some cases, access to the data may be necessary for external partners or clients who are involved in the project or have a business need for the data.

It is important to have measures in place to control and monitor data access, such as role-based access controls, to ensure that only authorized individuals have access to the data and that it is used appropriately. Data usage agreements and confidentiality agreements may also be necessary to protect the data and the interests of all involved parties.

Example 8.5

Problem

John is working as a data analyst for a large retail company. He has been assigned to a data science project to improve forecasting and increase sales. His team has been collecting various types of data for the project, including sales data, customer demographics, and market trends. As they delve further into the project, they realize that sharing the data with other departments and collaborating with external organizations could greatly benefit their research efforts. However, before sharing the data, they need to consider certain factors. John and his team discuss the different ways they can share the data. What is the most important factor that needs to be in place for secure data sharing? What tool(s) would be most effective in sharing the data in this situation?

Solution

In their discussions, data governance protocols stand out as the most important factor in guaranteeing that the data is shared and used responsibly and ethically. This includes setting guidelines and procedures for data usage, implementing security measures to protect data privacy and integrity, and providing access only to authorized individuals or organizations. For example, data governance policy might dictate that all personally identifiable information (PII) be anonymized or excluded when sharing with entities outside of the company. Since the data being shared is a mix of qualitative and quantitative data, an online file sharing app such as Google Drive or Microsoft OneDrive is most appropriate. Permissions should be granted to various team members based on their need to know.