Udayan Das; Aubrey Lawson; Chris Mayfield; Narges Norouzi

15.1 Introduction to data science

1.

c. Data acquisition is the first stage of the data science life cycle. Data can be collected by the data scientist or gathered previously and provided to the data scientist.

2.

b. The data science life cycle has four stages: 1) data acquisition, 2) data exploration, 3) data analysis, and 4) reporting.

3.

c. The data exploration stage includes data cleaning and visualization.

4.

a. Python is the most popular programming language in data science.

5.

b. NumPy is a Python library used in numerical data analysis.

6.

a. Google Colaboratory is a Google document that can be shared with other Google accounts for viewing or editing the content.

15.2 NumPy

1.

c. To create an m by n matrix of all zeros, np.zeros((m,n)) can be used.

2.

a. ndarray is a data type supported by NumPy. When printing the type of an ndarray object, <'numpy.ndarray'> is printed.

3.

d. NumPy array is optimized for computational and memory efficiency while also offering array-oriented computation.

4.

b. 2 * arr multiplies all elements by 2 resulting in:

[[2 4]
[6 8]]

5.

c. np.array([[1, 2], [1, 2], [1, 2]]) creates an ndarray of 3 by 2. The T operator transposes the array. The transpose operator takes an array of m by n and converts the array to an n by m array. The result of applying the transpose operator on the given array is a 2 by 3 array.

6.

a. The result of element-wise multiplication between arr1 and arr2 is:

[[1 0]
[0 4]]

15.3 Pandas

1.

a. Series is a Pandas data structure representing one-dimensional labeled data.

2.

a. A DataFrame object can be considered a collection of one-dimensional labeled objects represented by Series objects.

3.

b. The Pandas DataFrame can store columns of varying data types, while NumPy only supports numeric data types.

4.

c. The function head() returns a DataFrame's top rows. If an argument is not specified, the default number of returned rows is five.

5.

c. The unique() function when applied to a column returns the unique values (rows) in the given column.

6.

a. When the function describe() is applied to a DataFrame, summary statistics of numerical columns will be generated.

15.4 Exploratory data analysis

1.

a. The element in the first row and the first column is a.

2.

c. The element in the second row and the second column is

.

3.

a. The element at row label

and column label A is c.

4.

b. df.loc[0:2, "A"] returns rows 0 to 2 (inclusive) of column A.

5.

c. loc[1] returns the row with label

, which corresponds to the second row of the DataFrame.

6.

b. iloc[:, 0] selects all rows corresponding to the column index 0, which equals returning the first column.

7.

a. The condition returns all rows where the value in the column with label A is divisible by 2.

8.

b. isnull() returns a Boolean DataFrame representing whether data entries are Null or not.

9.

a. fillna() replaces all Null values with the provided value passed as an argument.

10.

c. The function sum() is applied to each column separately and sums up values in columns. The result is the number of Null values in each column.

15.5 Data visualization

1.

b. A histogram is used to plot the distribution of continuous variables.

2.

c. A box plot is used to show the distribution of a continuous variable, along with visualizing outliers, minimum, maximum, and quartiles.

3.

a. A line plot is used to show trends and changes over time between two variables.

4.

c. The function call plt.scatter(x, y) creates a scatter plot based on the data stored in the variables x and y.

5.

b. The given code plots a bar plot with four bars representing the given categories. The height of the bars corresponds to the values stored in the values variable.

6.

b. Plotly is a library that creates interactive visualizations.

Chapter 15

15.1 Introduction to data science

15.2 NumPy

15.3 Pandas

15.4 Exploratory data analysis

15.5 Data visualization