This appendix provides a summary of Python functions used in this textbook. The intent is to provide students with a cross-reference of Python commands that includes a description of the Python functions, general syntax for usage, and a link to the section where the function is first used in the text.
Please note this is a very high-level description of these functions. Many functions require specific libraries to be installed. For more details on Python functions, syntax, and usage, please refer to the Python documentation posted online.
| Python Function | Description | Syntax | First Reference |
|---|---|---|---|
| What Are Data and Data Science? | |||
print() |
Prints a specified message or specified values to the screen or other output device | print(“text”)print(x, y) |
Python Basics for Data Science |
pd.read_csv() |
Loads data from a CSV (comma-separated values) file and stores in a DataFrame | pd.read_csv |
Python Basics for Data Science |
DataFrame.describe() |
Returns a table with basic statistics for a dataset including min, max, mean, count, and quartiles | DataFrame.describe()Where: DataFrame is the name of theDataFrame. |
Python Basics for Data Science |
DataFrame.iloc[] |
Allows access to data in a DataFrame using row/column integer-based indexes. | DataFrame.iloc[row, column]Where: DataFrame is the name of theDataFrame. |
Python Basics for Data Science |
DataFrame.loc[] |
Used to access a group of rows and columns by labels or a Boolean array | DataFrame.loc[criteria]Where: DataFrame is the name of theDataFrame. |
Python Basics for Data Science |
Plt.scatter() |
Generates a scatterplot for (x, y) data | plt.scatter(x_data, y_data) |
Python Basics for Data Science |
Plt.title() |
Specifies a title for a chart | plt.title(“Title”) |
Python Basics for Data Science |
Plt.xlabel() |
Specifies a label for the x-axis | plt.xlabel(“x-axis label”) |
Python Basics for Data Science |
Plt.ylabel() |
Specifies a label for the y-axis | plt.ylabel(“y-axis label”) |
Python Basics for Data Science |
Plt.xlim() |
Specifies limits to use for x-axis numbering | plt.xlim(lower, upper) |
Python Basics for Data Science |
Plt.ylim() |
Specifies limits to use for y-axis numbering | plt.ylim(lower, upper) |
Python Basics for Data Science |
| Collecting and Preparing Data | |||
pd.read_html() |
Read HTML table from a web page and convert into a DataFrame | pd.read_html(URL) |
Web Scraping and Social Media Data Collection |
pd.to_numeric() |
Converts strings or other data types to numeric values | pd.to_numeric |
Web Scraping and Social Media Data Collection |
len() |
Returns the length of an object | len(object) |
Web Scraping and Social Media Data Collection |
re.findall() |
Returns all non-overlapping matches of a specified pattern in a string | re.findall(pattern, string) |
Web Scraping and Social Media Data Collection |
re.search() |
Checks if a specified pattern appears in a string | re.search(pattern, string) |
Web Scraping and Social Media Data Collection |
| Descriptive Statistics: Statistical Measurements and Probability Distributions | |||
binom.pmf() |
Calculates the probability mass function (PMF) for a binomial distribution. It gives the probability of having exactly x successes in n trials with success probability p. | binom.pmf(x, n, p)Where: x is the number of successes in the experiment, n is the number of trials in the experiment, p is the probability of success. |
Discrete and Continuous Probability Distributions |
round() |
Rounds a numeric result to a specified level of precision | round(number, digits) |
Discrete and Continuous Probability Distributions |
poisson.pmf() |
Calculates probabilities associated with the Poisson distribution | poisson.pmf(x, mu)Where: x is the number of events of interest, mu is the mean of the Poisson distribution. |
Discrete and Continuous Probability Distributions |
norm.cdf() |
Calculates probabilities associated with the normal distribution (returns the area under the normal probability density function to the left of a specified measurement) | norm.cdf(x, mu, std)Where: x is the measurement of interest, mu is the mean of the normal distribution, std is the standard deviation of the normal distribution. |
Discrete and Continuous Probability Distributions |
| Inferential Statistics and Regression Analysis | |||
t.ppf() |
Generates the value of the t-distribution corresponding to a specified area under the t-distribution curve and specified degrees of freedom | t.ppf |
Statistical Inference and Confidence Intervals |
bootstrap() |
Performs bootstrap process to generate confidence interval | bootstrap |
Statistical Inference and Confidence Intervals |
norm.interval() |
Calculates confidence interval for the mean when population standard deviation is known, given sample mean, population standard deviation, and sample size (uses normal distribution). Note: Standard error is the standard deviation divided by the square root of the sample size. | norm.interval |
Statistical Inference and Confidence Intervals |
t.interval() |
Calculates confidence interval for the mean when population standard deviation is unknown, given sample mean, sample standard deviation, and sample size (uses t-distribution). Note, standard error is the standard deviation divided by the square root of the sample size. | t.interval |
Statistical Inference and Confidence Intervals |
proportion_confint() |
Calculates confidence interval for a proportion (uses normal distribution) | proportion_confint |
Statistical Inference and Confidence Intervals |
ttest_1samp() |
Returns the value of the test statistic and the two-tailed p-value for a one-sample hypothesis test using the t-distribution | ttest_1samp |
Hypothesis Testing |
ttest_ind_from_stats() |
Returns the value of the test statistic and the two-tailed p-value for a two-sample hypothesis test using the t-distribution | ttest_ind_from_stats |
Hypothesis Testing |
np.array() |
Creates a numerical array from a list-like object | np.array(object) |
Correlation and Linear Regression Analysis |
pearsonr() |
Calculates the value of the Pearson correlation coefficient r | pearsonr |
Correlation and Linear Regression Analysis |
linregress() |
Generates a linear regression model and provides slope, y-intercept, and other regression-related output | linregress |
Correlation and Linear Regression Analysis |
f_oneway() |
Returns both the F test statistic and the p-value for the one-way ANOVA hypothesis test | f_oneway |
Analysis of Variance (ANOVA) |
| Time Series and Forecasting | |||
plot() |
Generates a time series plot | plot(dataframe) |
Introduction to Time Series Analysis |
rolling() |
Provides rolling window calculations | rolling |
Time Series Forecasting Methods |
mean() |
Computes the average of a dataset | mean(dataset) |
Time Series Forecasting Methods |
diff() |
Computes the first-order difference of data in a window | diff(dataframe) |
Time Series Forecasting Methods |
plot_acf() |
Plots the ACF (autocorrelation function) for a time series, up to lag L | Plot_acf |
Time Series Forecasting Methods |
STL() |
Decomposes a time series with known period P into its components | STL |
Time Series Forecasting Methods |
ewm() |
Performs exponential moving average (EMA) smoothing | ewm(dataframe) |
Time Series Forecasting Methods |
adfuller() |
Performs the Augmented Dickey-Fuller (ADF) test, which is a statistical test for checking the stationarity of a time series | adfuller |
Time Series Forecasting Methods |
ARIMA() |
Fits an ARIMA(p, d, q) (AutoRegressive Integrated Moving Average) model to time series data | ARIMA |
Time Series Forecasting Methods |
| Decision-Making Using Machine Learning Basics | |||
LogisticRegression() |
Creates a logistic regression model | LogisticRegression() |
Classification Using Machine Learning |
model.fit() |
Trains a machine learning model on a given dataset | model.fit |
Classification Using Machine Learning |
KMeans() |
Sets up a k-means clustering model (Use model.fit() to fit the model to a dataset.) | KMeans(n_clusters=k) |
Classification Using Machine Learning |
DBSCAN() |
Sets up a DBSCAN (Density-Based Spatial Clustering of Applications with Noise) model (Use model.fit() to fit the model to a dataset.) | DBSCAN(options) |
Classification Using Machine Learning |
confusion_matrix() |
Used to visualize the performance of a model by comparing actual and predicted values | confusion_matrix |
Classification Using Machine Learning |
LinearRegression() |
Fits a linear regression model to data | LinearRegression().fit(feature_matrix, |
Machine Learning in Regression Analysis |
predict() |
Used on trained machine learning models to generate predictions for new data points | predict(feature_matrix) |
Machine Learning in Regression Analysis |
DecisionTreeClassifier() |
Sets up a decision tree model (Use model.fit() to fit the model to a dataset.) | DecisionTreeClassifier |
Decision Trees |
ens.RandomForestRegressor() |
Sets up a random forest model (Use model.fit() to fit the model to a dataset.) | ens.RandomForestRegressor |
Other Machine Learning Techniques |
GaussianNB() |
Set up a Naïve Bayes classification model (Use model.fit() to fit the model to a dataset.) | GaussianNB() |
Other Machine Learning Techniques |
| Deep Learning and Artificial Intelligence (AI) Basics | |||
Perceptron() |
Sets up a perceptron model (Use model.fit() to fit the model to a dataset.) | Perceptron() |
Introduction to Neural Networks |
train_test_split() |
Splits dataset randomly into train and test subsets, using a proportion of P of the data for the test set | train_test_split |
Introduction to Neural Networks |
StandardScaler() |
Used to standardize features by removing the mean and scaling to unit variance | StandardScaler() |
Introduction to Neural Networks |
accuracy_score() |
Calculates the accuracy of a classification model as the ratio of the number of correct predictions to the total number of predictions | accuracy_score |
Introduction to Neural Networks |
scaler.fit_transform() |
Fits a scaler to the data and then transforms the data according to the fitted scaler | scaler.fit_transform(array) |
Introduction to Neural Networks |
scaler.transform() |
Applies a previously fitted scaler to new data | scaler.transform(array) |
Introduction to Neural Networks |
tf.keras.Sequential() |
Creates a linear stack of layers for building a neural network model | tf.keras.Sequential |
Backpropagation |
model.compile() |
Used to configure the learning process of a neural network model before training | model.compile |
Backpropagation |
| Visualizing Data | |||
boxplot() |
Creates a box-and-whisker plot | plt.boxplot(array) |
Encoding Univariate Data |
hist() |
Creates a histogram | plt.hist (array) |
Encoding Univariate Data |
plot() |
Creates 2D line plots such as a time series graph | plt.plot |
Graphing Probability Distributions |
bar() |
Creates a bar chart | plt.bar |
Graphing Probability Distributions |
imshow() |
Displays an image on a 2D regular raster, such as a heatmap | plt.imshow(array) |
Geospatial and Heatmap Data Visualization Using Python |
heatmap() |
Creates a heatmap visualization | sns.heatmap(array) |
Geospatial and Heatmap Data Visualization Using Python |
colorbar() |
Adds a colormap to a figure | plt.colorbar() |
Multivariate and Network Data Visualization Using Python |
corr() |
Calculates the pairwise correlations of columns in a DataFrame | dataframe.corr() |
Multivariate and Network Data Visualization Using Python |
add.subplot() |
Adds a subplot to a figure stored in fig | fig.add.subplot |
Multivariate and Network Data Visualization Using Python |
ax.scatter() |
Creates a scatterplot | ax.scatter |
Multivariate and Network Data Visualization Using Python |
| Reporting Results | |||
plot_tree() |
Creates a visualization of a decision tree | plot_tree |
Validating Your Model |
DataFrame.info() |
Provides a concise summary of a DataFrame's structure and content | DataFrame.info() |
Validating Your Model |
DataFrame.drop() |
Removes rows or columns from a DataFrame | DataFrame.drop |
Validating Your Model |
score() |
Evaluates the performance of a trained model on a given dataset | model.score |
Validating Your Model |
dt.get_depth() |
Retrieves the depth of the decision tree, dt | dt.get_depth() |
Validating Your Model |
cross_val_score() |
Evaluates a model's performance using cross-validation | cross_val_score |
Validating Your Model |
GridSearchCV () |
Search for the best parameters for a specified estimator, with k-fold cross-validation | GridSearchCV |
Validating Your Model |