Highlights from this chapter include:
- Data science is a multidisciplinary field that combines collection, processing, and analysis of large volumes of data to extract insights and drive informed decision-making.
- The data science life cycle is the framework followed by data scientists to complete a data science project.
- The data science life cycle includes 1) data acquisition, 2) data exploration, 3) data analysis, and 4) reporting.
- Google Colaboratory is a cloud-based Jupyter Notebook environment that allows programmers to write, run, and share Python code online.
- NumPy (Numerical Python) is a Python library that provides support for efficient numerical operations on large, multi-dimensional arrays and serves as a fundamental building block for data analysis in Python.
- NumPy implements an
ndarray
object that allows the creation of multi-dimensional arrays of homogeneous data types and efficient data processing. - NumPy provides functionalities for mathematical operations, array manipulation, and linear algebra operations.
- Pandas is an open-source Python library used for data cleaning, processing, and analysis.
- Pandas provides Series and DataFrame data structures, data processing functionality, and integration with other libraries.
- Exploratory Data Analysis (EDA) is the task of analyzing data to gain insights, identify patterns, and understand the underlying structure of the data.
- A feature is an individual variable or attribute that is calculated from raw data in a dataset.
- Data indexing can be used to select and access specific rows and columns.
- Data slicing refers to selecting a subset of rows and/or columns from a DataFrame.
- Data filtering involves selecting rows or columns based on certain conditions.
- Missing values in a dataset can occur when data are not available or were not recorded properly.
- Data visualization has a crucial role in data science for understanding the data.
- Different types of visualizations include bar plot, line plot, scatter plot, histogram plot, and box plot.
- Several Python data visualization libraries exist that offer a range of capabilities and features to create different plot types. These libraries include Matplotlib, Seaborn, and Plotly.
- The conventional aliases for importing NumPy, Pandas, and Matplotlib.pyplot are
np
,pd
, andplt
, respectively.
At this point, you should be able to write programs to create data structures to store different datasets and explore and visualize datasets.
Function | Description |
---|---|
|
Creates an |
|
Creates an array of zeros. |
|
Creates an array of ones. |
|
Creates an array of random numbers with |
|
Creates an array from a CSV file. |
|
Creates a DataFrame from a list, dictionary, or an array. |
|
Creates a DataFrame from a CSV file. |
|
Returns the first few rows of a DataFrame. |
|
Returns the last few rows of a DataFrame. |
|
Provides a summary of the DataFrame, including the column names, data types, and the number of non-Nullvalues. |
|
Generates the column count, mean, standard deviation, minimum, maximum, and quartiles. |
|
Counts the occurrences of unique values in a column and presents them in descending order. |
|
Returns an array of unique values in a column. |
|
Allows for accessing data in a DataFrame using row/column labels. |
|
Allows for accessing data in a DataFrame using row/column integer-based indexes. |
|
Selects only the rows that meet the given |
|
Slices using label ranges. |
|
Slices rows that are in the list |
|
Returns a Boolean array with Boolean values representing whether each entry has been |
|
Replaces |
|
Removes all rows containing a |
|
Takes in two inputs, |
|
Takes in two inputs, |
|
Takes in two inputs, |
|
Takes in one input, |
|
Takes in one input, |