Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

D | Appendix D: Review of Python Functions

Principles of Data ScienceD | Appendix D: Review of Python Functions

This appendix provides a summary of Python functions used in this textbook. The intent is to provide students with a cross-reference of Python commands that includes a description of the Python functions, general syntax for usage, and a link to the section where the function is first used in the text.

Please note this is a very high-level description of these functions. Many functions require specific libraries to be installed. For more details on Python functions, syntax, and usage, please refer to the Python documentation posted online.

Python Function Description Syntax First Reference
What Are Data and Data Science?
print()
Prints a specified message or specified values to the screen or other output device
print(“text”)

print(x, y)
Python Basics for Data Science
pd.read_csv()
Loads data from a CSV (comma-separated values) file and stores in a DataFrame
pd.read_csv
(path_to_csv datafile)
Python Basics for Data Science
DataFrame.describe()
Returns a table with basic statistics for a dataset including min, max, mean, count, and quartiles
DataFrame.describe()


Where:
DataFrame
is the name of the
DataFrame.
Python Basics for Data Science
DataFrame.iloc[]
Allows access to data in a DataFrame using row/column integer-based indexes.
DataFrame.iloc[row, column]


Where:
DataFrame
is the name of the
DataFrame.
Python Basics for Data Science
DataFrame.loc[]
Used to access a group of rows and columns by labels or a Boolean array
DataFrame.loc[criteria]


Where:
DataFrame
is the name of the
DataFrame.
Python Basics for Data Science
Plt.scatter()
Generates a scatterplot for (x, y) data
plt.scatter(x_data, y_data)
Python Basics for Data Science
Plt.title()
Specifies a title for a chart
plt.title(“Title”)
Python Basics for Data Science
Plt.xlabel()
Specifies a label for the x-axis
plt.xlabel(“x-axis label”)
Python Basics for Data Science
Plt.ylabel()
Specifies a label for the y-axis
plt.ylabel(“y-axis label”)
Python Basics for Data Science
Plt.xlim()
Specifies limits to use for x-axis numbering
plt.xlim(lower, upper)
Python Basics for Data Science
Plt.ylim()
Specifies limits to use for y-axis numbering
plt.ylim(lower, upper)
Python Basics for Data Science
Collecting and Preparing Data
pd.read_html()
Read HTML table from a web page and convert into a DataFrame
pd.read_html(URL)
Web Scraping and Social Media Data Collection
pd.to_numeric()
Converts strings or other data types to numeric values
pd.to_numeric
(column_name)
Web Scraping and Social Media Data Collection
len()
Returns the length of an object
len(object)
Web Scraping and Social Media Data Collection
re.findall()
Returns all non-overlapping matches of a specified pattern in a string
re.findall(pattern, string)
Web Scraping and Social Media Data Collection
re.search()
Checks if a specified pattern appears in a string
re.search(pattern, string)
Web Scraping and Social Media Data Collection
Descriptive Statistics: Statistical Measurements and Probability Distributions
binom.pmf()
Calculates the probability mass function (PMF) for a binomial distribution. It gives the probability of having exactly x successes in n trials with success probability p.
binom.pmf(x, n, p)


Where:
x is the number of successes in
the experiment,
n is the number of trials in the
experiment,
p is the probability of success.
Discrete and Continuous Probability Distributions
round()
Rounds a numeric result to a specified level of precision
round(number, digits)
Discrete and Continuous Probability Distributions
poisson.pmf()
Calculates probabilities associated with the Poisson distribution
poisson.pmf(x, mu)


Where:
x is the number of events of
interest,
mu is the mean of the Poisson
distribution.
Discrete and Continuous Probability Distributions
norm.cdf()
Calculates probabilities associated with the normal distribution (returns the area under the normal probability density function to the left of a specified measurement)
norm.cdf(x, mu, std)


Where:
x is the measurement of interest,
mu is the mean of the normal
distribution,
std is the standard deviation of
the normal distribution.
Discrete and Continuous Probability Distributions
Inferential Statistics and Regression Analysis
t.ppf()
Generates the value of the t-distribution corresponding to a specified area under the t-distribution curve and specified degrees of freedom
t.ppf
(area to left, degrees of
freedom)
Statistical Inference and Confidence Intervals
bootstrap()
Performs bootstrap process to generate confidence interval
bootstrap
(data, statistic,
confidence_level,
number_resamples)
Statistical Inference and Confidence Intervals
norm.interval()
Calculates confidence interval for the mean when population standard deviation is known, given sample mean, population standard deviation, and sample size (uses normal distribution). Note: Standard error is the standard deviation divided by the square root of the sample size.
norm.interval
(conf_level, sample_mean,
standard_error)
Statistical Inference and Confidence Intervals
t.interval()
Calculates confidence interval for the mean when population standard deviation is unknown, given sample mean, sample standard deviation, and sample size (uses t-distribution). Note, standard error is the standard deviation divided by the square root of the sample size.
t.interval
(conf_level,
degrees_freedom,
sample_mean,
standard_error)
Statistical Inference and Confidence Intervals
proportion_confint()
Calculates confidence interval for a proportion (uses normal distribution)
proportion_confint
(success, sample_size,
alpha)
Statistical Inference and Confidence Intervals
ttest_1samp()
Returns the value of the test statistic and the two-tailed p-value for a one-sample hypothesis test using the t-distribution
ttest_1samp
(data_array,
null_hypothesis_mean)
Hypothesis Testing
ttest_ind_from_stats()
Returns the value of the test statistic and the two-tailed p-value for a two-sample hypothesis test using the t-distribution
ttest_ind_from_stats
(sample_mean1,
sample_standard_deviation1,
sample_size1, sample_mean2,
sample_standard_deviation2,
sample_size2)
Hypothesis Testing
np.array()
Creates a numerical array from a list-like object
np.array(object)
Correlation and Linear Regression Analysis
pearsonr()
Calculates the value of the Pearson correlation coefficient r
pearsonr
(x_data, y_data)
Correlation and Linear Regression Analysis
linregress()
Generates a linear regression model and provides slope, y-intercept, and other regression-related output
linregress
(x_data, y_data)
Correlation and Linear Regression Analysis
f_oneway()
Returns both the F test statistic and the p-value for the one-way ANOVA hypothesis test
f_oneway
(Array1, Array2, Array3, …)
Analysis of Variance (ANOVA)
Time Series and Forecasting
plot()
Generates a time series plot
plot(dataframe)
Introduction to Time Series Analysis
rolling()
Provides rolling window calculations
rolling
(window=window)
Time Series Forecasting Methods
mean()
Computes the average of a dataset
mean(dataset)
Time Series Forecasting Methods
diff()
Computes the first-order difference of data in a window
diff(dataframe)
Time Series Forecasting Methods
plot_acf()
Plots the ACF (autocorrelation function) for a time series, up to lag L
Plot_acf
(time_series_data, lags=L)
Time Series Forecasting Methods
STL()
Decomposes a time series with known period P into its components
STL
(time_series_data,
period=P)
Time Series Forecasting Methods
ewm()
Performs exponential moving average (EMA) smoothing
ewm(dataframe)
Time Series Forecasting Methods
adfuller()
Performs the Augmented Dickey-Fuller (ADF) test, which is a statistical test for checking the stationarity of a time series
adfuller
(time_series_data)
Time Series Forecasting Methods
ARIMA()
Fits an ARIMA(p, d, q) (AutoRegressive Integrated Moving Average) model to time series data
ARIMA
(time_series_data,
order=(p, d, q))
Time Series Forecasting Methods
Decision-Making Using Machine Learning Basics
LogisticRegression()
Creates a logistic regression model
LogisticRegression()
Classification Using Machine Learning
model.fit()
Trains a machine learning model on a given dataset
model.fit
(feature_matrix,
target_vector)
Classification Using Machine Learning
KMeans()
Sets up a k-means clustering model (Use model.fit() to fit the model to a dataset.)
KMeans(n_clusters=k)
Classification Using Machine Learning
DBSCAN()
Sets up a DBSCAN (Density-Based Spatial Clustering of Applications with Noise) model (Use model.fit() to fit the model to a dataset.)
DBSCAN(options)
Classification Using Machine Learning
confusion_matrix()
Used to visualize the performance of a model by comparing actual and predicted values
confusion_matrix
(target_values,
predicted_values)
Classification Using Machine Learning
LinearRegression()
Fits a linear regression model to data
LinearRegression()

.fit(feature_matrix,
target_vector)
Machine Learning in Regression Analysis
predict()
Used on trained machine learning models to generate predictions for new data points
predict(feature_matrix)
Machine Learning in Regression Analysis
DecisionTreeClassifier()
Sets up a decision tree model (Use model.fit() to fit the model to a dataset.)
DecisionTreeClassifier
(options)
Decision Trees
ens.RandomForestRegressor()
Sets up a random forest model (Use model.fit() to fit the model to a dataset.)
ens.RandomForestRegressor
(options)
Other Machine Learning Techniques
GaussianNB()
Set up a Naïve Bayes classification model (Use model.fit() to fit the model to a dataset.)
GaussianNB()
Other Machine Learning Techniques
Deep Learning and Artificial Intelligence (AI) Basics
Perceptron()
Sets up a perceptron model (Use model.fit() to fit the model to a dataset.)
Perceptron()
Introduction to Neural Networks
train_test_split()
Splits dataset randomly into train and test subsets, using a proportion of P of the data for the test set
train_test_split
(input_data_arrays,
target_data, test_size=P)
Introduction to Neural Networks
StandardScaler()
Used to standardize features by removing the mean and scaling to unit variance
StandardScaler()
Introduction to Neural Networks
accuracy_score()
Calculates the accuracy of a classification model as the ratio of the number of correct predictions to the total number of predictions
accuracy_score
(y_true, y_predicted)
Introduction to Neural Networks
scaler.fit_transform()
Fits a scaler to the data and then transforms the data according to the fitted scaler
scaler.fit_transform(array)
Introduction to Neural Networks
scaler.transform()
Applies a previously fitted scaler to new data
scaler.transform(array)
Introduction to Neural Networks
tf.keras.Sequential()
Creates a linear stack of layers for building a neural network model
tf.keras.Sequential
(layers, additional
options)
Backpropagation
model.compile()
Used to configure the learning process of a neural network model before training
model.compile
(optimizer, loss, metrics)
Backpropagation
Visualizing Data
boxplot()
Creates a box-and-whisker plot
plt.boxplot(array)
Encoding Univariate Data
hist()
Creates a histogram
plt.hist (array)
Encoding Univariate Data
plot()
Creates 2D line plots such as a time series graph
plt.plot
(x_data, y_data)
Graphing Probability Distributions
bar()
Creates a bar chart
plt.bar
(x_array, heights)
Graphing Probability Distributions
imshow()
Displays an image on a 2D regular raster, such as a heatmap
plt.imshow(array)
Geospatial and Heatmap Data Visualization Using Python
heatmap()
Creates a heatmap visualization
sns.heatmap(array)
Geospatial and Heatmap Data Visualization Using Python
colorbar()
Adds a colormap to a figure
plt.colorbar()
Multivariate and Network Data Visualization Using Python
corr()
Calculates the pairwise correlations of columns in a DataFrame
dataframe.corr()
Multivariate and Network Data Visualization Using Python
add.subplot()
Adds a subplot to a figure stored in fig
fig.add.subplot
(position)
Multivariate and Network Data Visualization Using Python
ax.scatter()
Creates a scatterplot
ax.scatter
(x_data, y_data)
Multivariate and Network Data Visualization Using Python
Reporting Results
plot_tree()
Creates a visualization of a decision tree
plot_tree
(estimator, feature_names)
Validating Your Model
DataFrame.info()
Provides a concise summary of a DataFrame's structure and content
DataFrame.info()
Validating Your Model
DataFrame.drop()
Removes rows or columns from a DataFrame
DataFrame.drop
(labels, axis=rows_columns)
Validating Your Model
score()
Evaluates the performance of a trained model on a given dataset
model.score
(feature_matrix,
true_labels)
Validating Your Model
dt.get_depth()
Retrieves the depth of the decision tree, dt
dt.get_depth()
Validating Your Model
cross_val_score()
Evaluates a model's performance using cross-validation
cross_val_score
(estimator, feature_matrix,
target_variable)
Validating Your Model
GridSearchCV ()
Search for the best parameters for a specified estimator, with k-fold cross-validation
GridSearchCV
(estimator, parameters, k)
Validating Your Model
Table D1
Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.