Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Introduction to Python Programming

15.1 Introduction to data science

Introduction to Python Programming15.1 Introduction to data science

Learning objectives

By the end of this section you should be able to

  • Describe data science.
  • Identify different stages of the data science life cycle.
  • Name data science tools and software.
  • Use Google Colaboratory to run code.

Data science life cycle

Data science is a multidisciplinary field that combines collecting, processing, and analyzing large volumes of data to extract insights and drive informed decision-making. The data science life cycle is the framework followed by data scientists to complete a data science project. The data science life cycle is an iterative process that starts with data acquisition, followed by data exploration. The data acquisition stage may involve obtaining data from a source or collecting data through surveys and other means of data collection that are domain-specific. During the data exploration stage, data scientists will ensure that the data are in the right format for the data analysis stage through data cleanup and they may also visualize the data for further inspection. Once the data are cleaned, data scientists can perform data analysis, which is shared with stakeholders using reports and presentations. The data analysis stage involves using data to generate insights or make a predictive model. Data science is increasingly being adopted in many different fields, such as healthcare, economics, education, and social sciences, to name a few. The animation below demonstrates different stages of the data science life cycle.


Data science life cycle

Concepts in Practice

What is data science?

What is the first stage of any data science life cycle?
  1. data visualization
  2. data cleanup
  3. data acquisition
How many stages does the data science life cycle have?
  1. 3
  2. 4
  3. 5
What does a data scientist do in the data exploration stage?
  1. document insights and visualization
  2. analyze data
  3. data cleaning and visualization

Data science tools

Several tools and software are commonly used in data science. Here are some examples.

  • Python programming language: Python is widely used in data science. It has a large system of libraries designed for data analysis, machine learning, and visualization. Some popular Python libraries for data science include NumPy, Pandas, Matplotlib, Seaborn, and scikit-learn. In this chapter, you will explore some of these libraries.
  • R programming language: R is commonly used in statistical computing and data analysis, and it offers a wide range of packages and libraries tailored for data manipulation, statistical modeling, and visualization.
  • Jupyter Notebook/JupyterLab: Jupyter Notebook and JupyterLab are web-based interactive computing environments that support multiple programming languages, including Python and R. They allow a programmer to create documents that contain code, visualizations, and text, making them suitable for data exploration, analysis, and reporting.
  • Google Colaboratory: Google Colaboratory is a cloud-based Jupyter Notebook environment that allows a programmer to write, run, and share Python code online. In this chapter, you will use Google Colaboratory to practice data science concepts.
  • Kaggle Kernels: Kaggle Kernels is an online data science platform that provides a collaborative environment for building and running code. Kaggle Kernels support Python and R and offers access to datasets, pre-installed libraries, and computational resources. Kaggle also hosts data science competitions and provides a platform for sharing and discovering data science projects.
  • Excel/Sheets: Microsoft Excel and Google Sheets are widely used spreadsheet applications that offer basic data analysis and visualization capabilities. They can help beginners get started with data manipulation, basic statistical calculations, and simple visualizations.


Google Colaboratory ecosystem

Concepts in Practice

Data science tools and software

Between Python, R, and Java, which is the most popular language in data science?
  1. Python
  2. R
  3. Java
Which of the following is a data science-related library in Python?
  1. list
  2. NumPy
  3. array
Google Colaboratory can be used for reporting and sharing insights.
  1. true
  2. false

Programming practice with Google

Open the Google Colaboratory document below. To open the Colaboratory document, you need to login to a Google account, if you have one, or create a Google account. Run all cells. You may also attempt creating new cells or modifying existing cells. To save a copy of your edits, go to "File > Save a Copy in Drive", and the edited file will be stored in your own Google Drive.

Google Colaboratory document


This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at
Citation information

© Mar 15, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

This book utilizes the OpenStax Python Code Runner. The code runner is developed by Wiley and is All Rights Reserved.