Project A: Data Source Quality
As a student of, or a new professional working in, data science, you will not always be collecting new primary data. It’s just as important to be able to locate, critically evaluate, and properly clean existing sources of secondary data. (Collecting and Preparing Data will cover the topic of data collection and cleaning in more detail.)
Some reputable government data sources are:
Data.gov
Bureau of Labor Statistics (BLS)
National Oceanic and Atmospheric Administration (NOAA)
Some reputable nongovernment data sources are:
Kaggle
Statista
Pew Research Center
Using the suggested sources or similar-quality sources that you research on the Internet, find two to three datasets about the field or industry in which you intend to work. (You might also try to determine whether similar data sets are available at the national, state/province, and local/city levels.) In a group, formulate a specific, typical policy issue or business decision that managers in these organizations might make. For the datasets you found, compare and contrast their size, collection methods, types of data, update frequency and recency, and relevance to the decision question you have identified.
Project B: Data Visualization
Using one of the data sources mentioned in the previous project, find a dataset that interests you. Download it as a CSV file. Use Python to read in the CSV file as a Pandas
DataFrame. As a group, think of a specific question that might be addressed using this dataset, discuss which features of the data seem most important to answer your question, and then use the Python libraries Pandas
and Matplotlib
to select the features and make graphs that might help to answer your question about the data. Note, you will learn many sophisticated techniques for doing data analysis in later chapters, but for this project, you should stick to simply isolating some data and visualizing it using the tools present in this chapter. Write a brief report on your findings.
Project C: Privacy, Ethics, and Bias
Identify at least one example from recent current events or news articles that is related to each of the following themes (starting references given in parentheses):
- Privacy concerns related to data collection (See the Protecting Personal Privacy website of the U.S. Government Accountability Office.)
- Ethics concerns related to data collection, including fair use of copyrighted materials (See the U.S. Copyright Office guidelines.)
- Bias concerns related to data collection (See the National Cancer Institute (NCI) article on data bias.)
Suppose that you are part of a data science team working for an organization on data collection for a major project or product. Discuss as a team how the issues of privacy, ethics, and equity (avoiding bias) could be addressed, depending on your position in the organization and the type of project or product.