Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

A person fills out paperwork at a table under a canopy with “Census 2010” signs, while a diverse team assists.

Figure 2.1 Periodic population surveys such as censuses help governments plan resources to supply public services. (credit: modification of work “Census 2010 @ La Fuente” by Jenn Turner/Flickr, CC BY 2.0)

Chapter Outline

2.1 Overview of Data Collection Methods

2.2 Survey Design and Implementation

2.3 Web Scraping and Social Media Data Collection

2.4 Data Cleaning and Preprocessing

2.5 Handling Large Datasets

Introduction

Data collection and preparation are the first steps in the data science cycle. They involve systematically gathering the necessary data to meet a project's objectives and ensuring its readiness for further analysis. Well-executed data collection and preparation serve as a solid foundation for effective, data-driven decision-making and aid in detecting patterns, trends, and insights that can drive business growth and efficiency.

With today’s ever-increasing volume of data, a robust approach to data collection is crucial for ensuring accurate and meaningful results. This process requires following a comprehensive and systematic methodology designed to ensure the quality, reliability, and validity of data gathered for analysis. It involves identifying and sourcing relevant data from diverse sources, including internal databases, external repositories, websites, and user-generated information. And it requires meticulous planning and execution to guarantee the accuracy, comprehensiveness, and reliability of the collected data.

Preparing, or “wrangling,” the collected data adequately prior to analysis is equally important. Preparation involves scrubbing, organizing, and transforming the data into a format suitable for analysis. Data preparation plays a pivotal role in detecting and resolving any inconsistencies or errors present in the data, thereby enabling accurate analysis. The rapidly advancing technology and widespread use of the internet have added complexity to the data collection and preparation processes. As a result, data analysts and organizations face many challenges, such as identifying relevant data sources, managing large data volumes, identifying outliers or erroneous data, and handling unstructured data. By mastering the art and science of collecting and preparing data, organizations can leverage valuable insights to drive informed decision-making and achieve business success.