Udayan Das; Aubrey Lawson; Chris Mayfield; Narges Norouzi

Learning objectives

By the end of this section you should be able to

Describe data science.
Identify different stages of the data science life cycle.
Name data science tools and software.
Use Google Colaboratory to run code.

Data science life cycle

Data science is a multidisciplinary field that combines collecting, processing, and analyzing large volumes of data to extract insights and drive informed decision-making. The data science life cycle is the framework followed by data scientists to complete a data science project. The data science life cycle is an iterative process that starts with data acquisition, followed by data exploration. The data acquisition stage may involve obtaining data from a source or collecting data through surveys and other means of data collection that are domain-specific. During the data exploration stage, data scientists will ensure that the data are in the right format for the data analysis stage through data cleanup and they may also visualize the data for further inspection. Once the data are cleaned, data scientists can perform data analysis, which is shared with stakeholders using reports and presentations. The data analysis stage involves using data to generate insights or make a predictive model. Data science is increasingly being adopted in many different fields, such as healthcare, economics, education, and social sciences, to name a few. The animation below demonstrates different stages of the data science life cycle.

Checkpoint

Data science life cycle

Access multimedia content

Concepts in Practice

What is data science?

1.

What is the first stage of any data science life cycle?

data visualization
data cleanup
data acquisition

2.

How many stages does the data science life cycle have?

3
4
5

3.

What does a data scientist do in the data exploration stage?

document insights and visualization
analyze data
data cleaning and visualization

Data science tools

Several tools and software are commonly used in data science. Here are some examples.

Python programming language: Python is widely used in data science. It has a large system of libraries designed for data analysis, machine learning, and visualization. Some popular Python libraries for data science include NumPy, Pandas, Matplotlib, Seaborn, and scikit-learn. In this chapter, you will explore some of these libraries.
R programming language: R is commonly used in statistical computing and data analysis, and it offers a wide range of packages and libraries tailored for data manipulation, statistical modeling, and visualization.
Jupyter Notebook/JupyterLab: Jupyter Notebook and JupyterLab are web-based interactive computing environments that support multiple programming languages, including Python and R. They allow a programmer to create documents that contain code, visualizations, and text, making them suitable for data exploration, analysis, and reporting.
Google Colaboratory: Google Colaboratory is a cloud-based Jupyter Notebook environment that allows a programmer to write, run, and share Python code online. In this chapter, you will use Google Colaboratory to practice data science concepts.
Kaggle Kernels: Kaggle Kernels is an online data science platform that provides a collaborative environment for building and running code. Kaggle Kernels support Python and R and offers access to datasets, pre-installed libraries, and computational resources. Kaggle also hosts data science competitions and provides a platform for sharing and discovering data science projects.
Excel/Sheets: Microsoft Excel and Google Sheets are widely used spreadsheet applications that offer basic data analysis and visualization capabilities. They can help beginners get started with data manipulation, basic statistical calculations, and simple visualizations.

Checkpoint

Google Colaboratory ecosystem

Access multimedia content

Concepts in Practice

Data science tools and software

4.

Between Python, R, and Java, which is the most popular language in data science?

Python
R
Java

5.

Which of the following is a data science-related library in Python?

list
NumPy
array

6.

Google Colaboratory can be used for reporting and sharing insights.

true
false

Programming practice with Google

Open the Google Colaboratory document below. To open the Colaboratory document, you need to login to a Google account, if you have one, or create a Google account. Run all cells. You may also attempt creating new cells or modifying existing cells. To save a copy of your edits, go to "File > Save a Copy in Drive", and the edited file will be stored in your own Google Drive.

Google Colaboratory document

15.1 Introduction to data science

Learning objectives

Data science life cycle

Data science life cycle

What is data science?

Data science tools

Google Colaboratory ecosystem

Data science tools and software

Programming practice with Google