Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

Learning Outcomes

By the end of this section, you should be able to:

8.3.1 Recognize the importance of visualizing data in a way that accurately reflects the underlying information.
8.3.2 Define data source attribution and its significance in data science and research.
8.3.3 Identify barriers to accessibility and inclusivity and apply universal design principles.

After data is collected and stored securely, it should be analyzed to extract insights from the raw data using authorized tools and approved procedures, including data validation techniques. The insights are then visualized using charts, tables, and graphs, which are designed to communicate the findings clearly and concisely. The final report and conclusion should be prepared, ensuring that the visualizations used adhere to ethical guidelines such as avoiding misleading information or making unsubstantiated claims. Any assumptions and unavoidable biases should be clearly articulated when interpreting and reporting the data.

It is essential to ensure that data is presented with fairness and in a way that accurately reflects the underlying research and understanding. All data should be accompanied by appropriate and factual documentation. Additionally, both data and results should be safeguarded to prevent misinterpretation and to avoid manipulation by other parties. Moreover, barriers to accessibility need to be identified and addressed, following guidelines of inclusivity and universal design principles. These ethical principles should be adhered to throughout the data analysis process, particularly when interpreting and reporting findings.

To maintain ethical principles throughout the data analysis process and in reporting, the data scientist should adhere to these practices:

Exercise objectivity when drafting and presenting any reports or findings.
Acknowledge all third-party data sources appropriately.
Ensure that all visualizations are unambiguous and do not focus on any sensationalized data points.
Construct visualizations in a meaningful way, utilizing appropriate titles, labels, scales, and legends.
Present all data in a complete picture, avoiding masking or omitting portions of graphs.
Ensure that the scales on all axes in a chart are consistent and proportionate.
Exercise caution when implying causality between connected data points, providing supporting evidence if needed.
Utilize representative datasets of the population of interest.

Accurate Representation

Accurate representation is a crucial aspect of data science and reporting, and it refers to presenting data in a way that authentically reflects the underlying information. This includes ensuring that the data is not misrepresented or manipulated in any way and that the findings are based on reliable and valid data. From a data scientist perspective, accurate representation involves this sequence of steps:

Understanding the data. Before visualizing or reporting on data, a data scientist must have a thorough understanding of the data, including its sources, limitations, and potential biases.
Choosing appropriate visualizations. Data can be presented in a variety of ways, such as graphs, charts, tables, or maps. A data scientist must select the most suitable visualization method that accurately represents the data and effectively communicates the findings.
Avoiding bias and manipulation. Data scientists must avoid manipulating or cherry-picking data to support a specific narrative or agenda. This can lead to biased results and misinterpretations, which can have serious consequences.
Fact-checking and verifying data. Data scientists must ensure the accuracy and validity of the data they are working with. This involves cross-checking data from multiple sources and verifying its authenticity.
Proper data attribution. Giving credit to the sources of data and properly citing them is an important aspect of accurate representation. This allows for transparency and accountability in data reporting.

When reporting results, data scientists must be transparent about the data they have collected and how it was utilized. Did they obtain informed consent from individuals before collecting their data? What measures were taken to ensure the quality and reliability of the data? Moreover, as data and algorithms become more complex, it might be challenging to understand the sense behind the results. The absence of transparency may lead to skepticism in the results and decision-making process. Data analysts and modelers must demonstrate their strategies and results clearly and legibly.

Example 8.8

Problem

A team of data scientists is tasked with conducting a comprehensive analysis of a company's sales performance. To provide an accurate assessment of the trends and patterns, the team collects data from a variety of sources. The first data scientist presents a graph illustrating the company's sales trend over the course of the past year, including all identified outliers as shown in Figure 8.5. However, a second data scientist chooses to hide certain data points with lower sales figures from the graph as shown in Figure 8.6. Select and discuss the more appropriate representation of the data. Which data scientist is showing a more precise and representative visualization of the company's sales performance?

Line graph showing units sold over time. The data points plotted on the graph show a generally increase in value from left to right over a 12-year period with large dips at six different periods. The x axis labeled Time, Months displays evenly spaced tick marks from January 2000 to January 2022 in seven-month increments. The y-axis labeled Units Sale times 10,000 ranges from 0 to 250 in increments of 50.

Figure 8.5 First Data Scientist’s Presentation

Figure 8.6 Second Data Scientist’s Presentation

The second data scientist chooses to remove the low sales outliers from the graph. This approach can be useful in certain cases, such as when the data is extremely skewed and the outliers are significantly impacting the overall trend of the data. The first data scientist chooses to include all data points, including the outliers, in their graph. This means that the graph of the first data scientist includes unusual or extreme values that may lead to misleading insights and decisions.
The more accurate presentation would be the first one, which includes all the data points without any exclusion. By excluding the low sales points, the second data scientist is not presenting the complete picture, and it can cause misleading insights and decisions being made based on incomplete information. Additionally, the assumption is that the low sales points are genuine measurements and so may not in fact be random outliers.
In terms of accuracy, it is difficult to say which presentation is more accurate. In general, it is important to carefully consider the impact of outliers before deciding whether or not to include them in the data analysis. However, if there are clear errors or irregularities in the data, it may be appropriate to remove them to prevent them from skewing the results. Ultimately, both approaches have their merits and limitations.
The second data scientist chooses to remove the outliers from the data and only includes the data points that fall within a certain range. This results in a cleaner and more consistent trend in the graph, as the outliers are not shown to skew the data. However, removing the outliers may cause important information about the sales performance to not be accurately represented.

Solution

The correct answer is B. The presentation delivered by the first data scientist is deemed more precise as it offers a holistic and unbiased depiction of the sales trend. This enables a comprehensive understanding of the sales performance and any discernible patterns. On the contrary, the second data scientist's presentation may give the illusion of a positive trend in sales, but it fails to accurately portray the complete dataset. Omitting the lower sales points can distort the data and potentially lead to erroneous conclusions. Moreover, the act of disregarding data points goes against the principle of transparency in data analysis. Data scientists must present the entire dataset and accurately exhibit existing trends and patterns rather than manipulating the data to support a desired narrative. Additionally, the removal of data points without disclosure in the graph violates ethical principles in data analysis.

Example 8.9

Problem

Two expert data analysts have constructed bar graphs shown in Figure 8.7 and Figure 8.8 depicting the market share proportions of various soft drink brands presented in Table 8.1. Each bar is clearly labeled with the respective brand name and corresponding percentage, providing a straightforward and accurate representation of the data. Which of the two expert data analysts has accurately depicted the market share proportions of soft drink brands in their bar graph? Explain the difference between the two figures and potential consequences of each presentation.

Soft Drink	Share of Volume (%)
Coca-Cola	18.6%
Pepsi	9.1%
Diet Coke	8.0%
Dr. Pepper	7.5%
Sprite	7.2%
Mountain Dew	7.0%
Diet Pepsi	3.5%
Coke Zero	2.8%
Fanta	2.4%
Diet Mountain Dew	2.1%

Table 8.1 Global Soft Drinks Industry, Top 10 Brands in 2019, by Share of Volume
(source: https://www.wm-strategy.com/news/soft-drinks-industry-trends-and-size)

A bar graph labeled Market Share of Soft Drinks. The Y axis is labeled Share of Volume in percent and ranges from 0 to 20. The X axis is labeled Soft Drink with 10 brands of soft drink represented. The volume decreases left to right from about 19 for Coca-Cola to less than 2 for Diet Mountain Dew.

Figure 8.7 Graph A: Representation of First Data Scientist’s Analysis

A bar graph labeled Market Share of Soft Drinks. The (unlabeled) Y axis denotes ranges from 0 to 100 in increments of 20, and the X axis has bars for the same 10 soft drink brands as in the previous figure. The market share decreases left to right from about 19 for Coca-Cola to less than 2 for Diet Mountain Dew.

Figure 8.8 Graph B: Representation of Second Data Scientist’s Analysis (Market Share in %)

Solution

The first data scientist, who generated Graph A in Figure 8.7, has effectively represented the brand names and their corresponding market share percentages, using correct labeling and scaling of axes. This facilitates easy interpretation and comparison of the data. The second data scientist, who produced Graph B as shown in Figure 8.8, scaled the vertical axis from 0 to 100, making it more difficult to observe small differences in percentages. Moreover, Graph B fails to label the horizontal or vertical axes in equivalent detail, which may cause undue confusion, even if the intended labels may be clear from context.

Example 8.10

Problem

A data scientist at a health care research institution is analyzing the effects of a new medication on a specific disease. After collecting data from multiple clinical trials, the scientist summarized the results in Table 8.2 and created a histogram to show the distribution of participants across different age groups as shown in Figure 8.9. The scientist included age intervals of 18–25, 26–36, 37–46, 47–56, 57–66, and 67+ in the chart. To provide a complete understanding of the data, what additional information should be included?

The percentage of participants in each group to accurately depict the proportion of participants in each age group
The total number of participants to demonstrate the scope of the results
The number of participants in each age group for a comprehensive overview of the data
Options A, B, and C should all be included in the chart

Age	Participants
18–26	100
27–36	120
37–46	107
47–56	71
57–66	89
67+	113

Table 8.2 Statistical Data about Clinical Trials Participants

A bar graph, with the Y axis indicating numbers of participants, in the range from 0 to 150 in increments of 50. The X axis has 6 bars of different colors related to a legend of age ranges from 18 to 26 up to 67-plus. The height of the bars varies but most are between 75 and 110.

Figure 8.9 Insufficient Information on Participants in Clinical Trials

Solution

The correct answer is D. This is the most appropriate choice because the presentation of results needs to be improved to accurately depict the data. This includes clearly indicating the percentage and total number of participants in each group and the overall number of participants as given in Figure 8.10. See Statistical Inference and Confidence Intervals for a review of the significance and implications of accuracy and confidence in relation to data reliability and decision-making processes.

A bar graph with Y axis indicating number of participants, in the range from 0 to 140 in increments of 20. The X axis has 6 bars of different colors related to a legend of age ranges from 18 to 26 up to 67-plus. The height of the bars varies but most are between 100 and 120. On top of each bar, it displays the percentage relevant to the age group: 16% for 18-26, 20% for 27-36, 15% for 37-46, 12% for 47-56, 15% for 57-66, and 19% for 67+.

Figure 8.10 Data-based Clinical Trials Participants

Data Source Attribution

Data source attribution is the important practice of clearly identifying and acknowledging the sources employed in the visualizations and reporting of data. It is the data scientist's responsibility to uphold these principles and present data in a manner that is both transparent and ethical. Data source attribution is demanded for several reasons:

Accuracy and credibility. Attribution ensures that the data utilized in the visualizations and reporting is accurate and reliable. By clearly stating the sources, any potential errors or biases in the data can be identified and corrected, increasing the credibility of the information presented.
Trust and transparency. By disclosing the sources of data, it promotes trust and transparency between data producers and consumers. This is especially important when data is used to inform decision-making or to shape public perceptions.
Accountability. Featuring full attribution in the available results allows the rest of the research community to validate and further research the results provided. This safeguard enforces data integrity and holds developers accountable for the claims and conclusions put forth.
Privacy considerations. In some circumstances, the data employed in visualizations may retain sensitive or personal information. Attributing the source of the data can help protect individuals' privacy and avoid potential harm or misuse of their information.

It is important to note that if the data utilized in graphs or reports are not publicly available, permission must be obtained from the source before using it in new research or publication. This would also require proper attribution and citation to the source of the data.

Data source attribution is also closely tied to the principles of accessibility and inclusivity. By clearly stating the sources of data, it enables people with different backgrounds and abilities to access and understand the information presented.

Proper attribution of sources also builds trust. Without explicit citations and permissions in place, the data scientist can be accused of plagiarism, whether this is done intentionally or not. Consider the use of AI in education. Some instructors allow their students to use resources such as ChatGPT to assist them in their writing and research. Similarly to citing textual references, these instructors might require their students to include a statement on the use of AI, including what prompts they used and exactly what the AI or chatbot responded. The student may then modify the responses as needed for their project. Other instructors disallow the use of AI altogether and would consider it to be cheating if a student used this resource on an assignment. Questions of academic integrity abound, and similar ethical issues must also be considered when planning and conducting a data science project.

Accessibility and Inclusivity

One of the ethical responsibilities of data scientists and researchers is to ensure that the data they present is accessible and inclusive to all individuals regardless of their capabilities or experiences. This includes considering the needs of individuals with disabilities, individuals from different artistic or linguistic environments, and individuals with different levels of education or literacy. Indeed, incorporating universal design principles when reporting results will aid not only those with different abilities, but every individual who reads or views the report.

Universal design principles refer to a set of guidelines aimed at creating products, environments, and systems that are accessible and usable by all people regardless of age, ability, or disability. The goal of universal design is to ensure inclusivity and promote equal access, allowing everyone to participate in everyday activities without additional adaptations or accommodations. In data science, this may involve creating visualizations and reports that utilize accessible fonts, colors, and formats, and providing alternative versions for individuals who may have difficulty accessing or understanding the data in its original form. Barriers to accessibility and inclusivity include physical barriers, such as inaccessible visualizations for individuals with visual impairments. Also, linguistic and cultural barriers may prevent people from outside the researchers’ cultural group, or those with limited literacy and non-native speakers, from fully understanding complex data visualizations and reports. By applying universal design principles, data scientists and researchers can help mitigate these barriers and ensure that the presented data is accessible and inclusive to a wide range of individuals.

In addition to differences of ability, there are also significant issues in the ability of some individuals to access and use digital technologies because of socioeconomic differences. The digital divide refers to the gap between those who have access to digital technologies, such as the internet and computers, and those who do not. Factors such as geographic location, age, education level, and income level contribute to the digital divide, which can lead to data inequality, where certain groups are underrepresented or excluded from data-driven decision-making. Addressing the digital divide through investments in infrastructure (e.g., developing reliable internet access in underserved areas), digital literacy education, and inclusive data collection (methods that do not rely solely on participants having access to technology) will ultimately narrow this divide and foster greater social and economic equity.

Last but not least, data scientists should be cognizant of the role that their field plays in opening up opportunities to historically underrepresented and marginalized communities. The field of data science is becoming more inclusive over time, with greater emphasis on diversity, equity, and inclusion (DEI). In recent years, there has been a growing recognition of the need to increase diversity and inclusion in data science. Efforts to address this imbalance include creating mentorship programs, supporting underrepresented groups in STEM education, and promoting equitable hiring practices in tech companies. As the field evolves, there's a concerted effort to welcome more women and people from underrepresented backgrounds into data science roles.

Initiatives aimed at promoting representation in data science careers, addressing bias in data and algorithms, and supporting equitable access to educational resources are increasingly common. By fostering a culture of inclusion, data scientists can not only drive innovation but also ensure that the insights and technologies they develop benefit all segments of society.

Data science is becoming a standard part of the educational landscape, reaching diverse learners and providing pathways to employment in the field. The transition of data science from a specialized, highly technical, and exclusive subject to one that can be approached by a wider audience helps to bridge gaps in educational opportunities and fosters a more inclusive pipeline into data science careers. It also reflects the increasing demand for data science skills in the workforce, encouraging broader adoption in varied educational settings.

8.3 Ethics in Visualization and Reporting

Learning Outcomes

Accurate Representation

Problem

Solution

Problem

Solution

Problem

Solution

Data Source Attribution

Accessibility and Inclusivity