Learning objectives
By the end of this section you should be able to
- Describe data science.
- Identify different stages of the data science life cycle.
- Name data science tools and software.
- Use Google Colaboratory to run code.
Data science life cycle
Data science is a multidisciplinary field that combines collecting, processing, and analyzing large volumes of data to extract insights and drive informed decision-making. The data science life cycle is the framework followed by data scientists to complete a data science project. The data science life cycle is an iterative process that starts with data acquisition, followed by data exploration. The data acquisition stage may involve obtaining data from a source or collecting data through surveys and other means of data collection that are domain-specific. During the data exploration stage, data scientists will ensure that the data are in the right format for the data analysis stage through data cleanup and they may also visualize the data for further inspection. Once the data are cleaned, data scientists can perform data analysis, which is shared with stakeholders using reports and presentations. The data analysis stage involves using data to generate insights or make a predictive model. Data science is increasingly being adopted in many different fields, such as healthcare, economics, education, and social sciences, to name a few. The animation below demonstrates different stages of the data science life cycle.
Concepts in Practice
What is data science?
Data science tools
Several tools and software are commonly used in data science. Here are some examples.
- Python programming language: Python is widely used in data science. It has a large system of libraries designed for data analysis, machine learning, and visualization. Some popular Python libraries for data science include NumPy, Pandas, Matplotlib, Seaborn, and scikit-learn. In this chapter, you will explore some of these libraries.
- R programming language: R is commonly used in statistical computing and data analysis, and it offers a wide range of packages and libraries tailored for data manipulation, statistical modeling, and visualization.
- Jupyter Notebook/JupyterLab: Jupyter Notebook and JupyterLab are web-based interactive computing environments that support multiple programming languages, including Python and R. They allow a programmer to create documents that contain code, visualizations, and text, making them suitable for data exploration, analysis, and reporting.
- Google Colaboratory: Google Colaboratory is a cloud-based Jupyter Notebook environment that allows a programmer to write, run, and share Python code online. In this chapter, you will use Google Colaboratory to practice data science concepts.
- Kaggle Kernels: Kaggle Kernels is an online data science platform that provides a collaborative environment for building and running code. Kaggle Kernels support Python and R and offers access to datasets, pre-installed libraries, and computational resources. Kaggle also hosts data science competitions and provides a platform for sharing and discovering data science projects.
- Excel/Sheets: Microsoft Excel and Google Sheets are widely used spreadsheet applications that offer basic data analysis and visualization capabilities. They can help beginners get started with data manipulation, basic statistical calculations, and simple visualizations.
Concepts in Practice
Data science tools and software
Programming practice with Google
Open the Google Colaboratory document below. To open the Colaboratory document, you need to login to a Google account, if you have one, or create a Google account. Run all cells. You may also attempt creating new cells or modifying existing cells. To save a copy of your edits, go to "File > Save a Copy in Drive", and the edited file will be stored in your own Google Drive.