This appendix provides a summary of Python functions used in this textbook. The intent is to provide students with a cross-reference of Python commands that includes a description of the Python functions, general syntax for usage, and a link to the section where the function is first used in the text.
Please note this is a very high-level description of these functions. Many functions require specific libraries to be installed. For more details on Python functions, syntax, and usage, please refer to the Python documentation posted online.
Python Function | Description | Syntax | First Reference |
---|---|---|---|
What Are Data and Data Science? | |||
print() |
Prints a specified message or specified values to the screen or other output device | print(“text”) print(x, y) |
Python Basics for Data Science |
pd.read_csv() |
Loads data from a CSV (comma-separated values) file and stores in a DataFrame | pd.read_csv |
Python Basics for Data Science |
DataFrame.describe() |
Returns a table with basic statistics for a dataset including min, max, mean, count, and quartiles | DataFrame.describe() Where: DataFrame is the name of theDataFrame. |
Python Basics for Data Science |
DataFrame.iloc[] |
Allows access to data in a DataFrame using row/column integer-based indexes. | DataFrame.iloc[row, column] Where: DataFrame is the name of theDataFrame. |
Python Basics for Data Science |
DataFrame.loc[] |
Used to access a group of rows and columns by labels or a Boolean array | DataFrame.loc[criteria] Where: DataFrame is the name of theDataFrame. |
Python Basics for Data Science |
Plt.scatter() |
Generates a scatterplot for (x, y) data | plt.scatter(x_data, y_data) |
Python Basics for Data Science |
Plt.title() |
Specifies a title for a chart | plt.title(“Title”) |
Python Basics for Data Science |
Plt.xlabel() |
Specifies a label for the x-axis | plt.xlabel(“x-axis label”) |
Python Basics for Data Science |
Plt.ylabel() |
Specifies a label for the y-axis | plt.ylabel(“y-axis label”) |
Python Basics for Data Science |
Plt.xlim() |
Specifies limits to use for x-axis numbering | plt.xlim(lower, upper) |
Python Basics for Data Science |
Plt.ylim() |
Specifies limits to use for y-axis numbering | plt.ylim(lower, upper) |
Python Basics for Data Science |
Collecting and Preparing Data | |||
pd.read_html() |
Read HTML table from a web page and convert into a DataFrame | pd.read_html(URL) |
Web Scraping and Social Media Data Collection |
pd.to_numeric() |
Converts strings or other data types to numeric values | pd.to_numeric |
Web Scraping and Social Media Data Collection |
len() |
Returns the length of an object | len(object) |
Web Scraping and Social Media Data Collection |
re.findall() |
Returns all non-overlapping matches of a specified pattern in a string | re.findall(pattern, string) |
Web Scraping and Social Media Data Collection |
re.search() |
Checks if a specified pattern appears in a string | re.search(pattern, string) |
Web Scraping and Social Media Data Collection |
Descriptive Statistics: Statistical Measurements and Probability Distributions | |||
binom.pmf() |
Calculates the probability mass function (PMF) for a binomial distribution. It gives the probability of having exactly x successes in n trials with success probability p. | binom.pmf(x, n, p) Where: x is the number of successes in the experiment, n is the number of trials in the experiment, p is the probability of success. |
Discrete and Continuous Probability Distributions |
round() |
Rounds a numeric result to a specified level of precision | round(number, digits) |
Discrete and Continuous Probability Distributions |
poisson.pmf() |
Calculates probabilities associated with the Poisson distribution | poisson.pmf(x, mu) Where: x is the number of events of interest, mu is the mean of the Poisson distribution. |
Discrete and Continuous Probability Distributions |
norm.cdf() |
Calculates probabilities associated with the normal distribution (returns the area under the normal probability density function to the left of a specified measurement) | norm.cdf(x, mu, std) Where: x is the measurement of interest, mu is the mean of the normal distribution, std is the standard deviation of the normal distribution. |
Discrete and Continuous Probability Distributions |
Inferential Statistics and Regression Analysis | |||
t.ppf() |
Generates the value of the t-distribution corresponding to a specified area under the t-distribution curve and specified degrees of freedom | t.ppf |
Statistical Inference and Confidence Intervals |
bootstrap() |
Performs bootstrap process to generate confidence interval | bootstrap |
Statistical Inference and Confidence Intervals |
norm.interval() |
Calculates confidence interval for the mean when population standard deviation is known, given sample mean, population standard deviation, and sample size (uses normal distribution). Note: Standard error is the standard deviation divided by the square root of the sample size. | norm.interval |
Statistical Inference and Confidence Intervals |
t.interval() |
Calculates confidence interval for the mean when population standard deviation is unknown, given sample mean, sample standard deviation, and sample size (uses t-distribution). Note, standard error is the standard deviation divided by the square root of the sample size. | t.interval |
Statistical Inference and Confidence Intervals |
proportion_confint() |
Calculates confidence interval for a proportion (uses normal distribution) | proportion_confint |
Statistical Inference and Confidence Intervals |
ttest_1samp() |
Returns the value of the test statistic and the two-tailed p-value for a one-sample hypothesis test using the t-distribution | ttest_1samp |
Hypothesis Testing |
ttest_ind_from_stats() |
Returns the value of the test statistic and the two-tailed p-value for a two-sample hypothesis test using the t-distribution | ttest_ind_from_stats |
Hypothesis Testing |
np.array() |
Creates a numerical array from a list-like object | np.array(object) |
Correlation and Linear Regression Analysis |
pearsonr() |
Calculates the value of the Pearson correlation coefficient r | pearsonr |
Correlation and Linear Regression Analysis |
linregress() |
Generates a linear regression model and provides slope, y-intercept, and other regression-related output | linregress |
Correlation and Linear Regression Analysis |
f_oneway() |
Returns both the F test statistic and the p-value for the one-way ANOVA hypothesis test | f_oneway |
Analysis of Variance (ANOVA) |
Time Series and Forecasting | |||
plot() |
Generates a time series plot | plot(dataframe) |
Introduction to Time Series Analysis |
rolling() |
Provides rolling window calculations | rolling |
Time Series Forecasting Methods |
mean() |
Computes the average of a dataset | mean(dataset) |
Time Series Forecasting Methods |
diff() |
Computes the first-order difference of data in a window | diff(dataframe) |
Time Series Forecasting Methods |
plot_acf() |
Plots the ACF (autocorrelation function) for a time series, up to lag L | Plot_acf |
Time Series Forecasting Methods |
STL() |
Decomposes a time series with known period P into its components | STL |
Time Series Forecasting Methods |
ewm() |
Performs exponential moving average (EMA) smoothing | ewm(dataframe) |
Time Series Forecasting Methods |
adfuller() |
Performs the Augmented Dickey-Fuller (ADF) test, which is a statistical test for checking the stationarity of a time series | adfuller |
Time Series Forecasting Methods |
ARIMA() |
Fits an ARIMA(p, d, q) (AutoRegressive Integrated Moving Average) model to time series data | ARIMA |
Time Series Forecasting Methods |
Decision-Making Using Machine Learning Basics | |||
LogisticRegression() |
Creates a logistic regression model | LogisticRegression() |
Classification Using Machine Learning |
model.fit() |
Trains a machine learning model on a given dataset | model.fit |
Classification Using Machine Learning |
KMeans() |
Sets up a k-means clustering model (Use model.fit() to fit the model to a dataset.) | KMeans(n_clusters=k) |
Classification Using Machine Learning |
DBSCAN() |
Sets up a DBSCAN (Density-Based Spatial Clustering of Applications with Noise) model (Use model.fit() to fit the model to a dataset.) | DBSCAN(options) |
Classification Using Machine Learning |
confusion_matrix() |
Used to visualize the performance of a model by comparing actual and predicted values | confusion_matrix |
Classification Using Machine Learning |
LinearRegression() |
Fits a linear regression model to data | LinearRegression() .fit(feature_matrix, |
Machine Learning in Regression Analysis |
predict() |
Used on trained machine learning models to generate predictions for new data points | predict(feature_matrix) |
Machine Learning in Regression Analysis |
DecisionTreeClassifier() |
Sets up a decision tree model (Use model.fit() to fit the model to a dataset.) | DecisionTreeClassifier |
Decision Trees |
ens.RandomForestRegressor() |
Sets up a random forest model (Use model.fit() to fit the model to a dataset.) | ens.RandomForestRegressor |
Other Machine Learning Techniques |
GaussianNB() |
Set up a Naïve Bayes classification model (Use model.fit() to fit the model to a dataset.) | GaussianNB() |
Other Machine Learning Techniques |
Deep Learning and Artificial Intelligence (AI) Basics | |||
Perceptron() |
Sets up a perceptron model (Use model.fit() to fit the model to a dataset.) | Perceptron() |
Introduction to Neural Networks |
train_test_split() |
Splits dataset randomly into train and test subsets, using a proportion of P of the data for the test set | train_test_split |
Introduction to Neural Networks |
StandardScaler() |
Used to standardize features by removing the mean and scaling to unit variance | StandardScaler() |
Introduction to Neural Networks |
accuracy_score() |
Calculates the accuracy of a classification model as the ratio of the number of correct predictions to the total number of predictions | accuracy_score |
Introduction to Neural Networks |
scaler.fit_transform() |
Fits a scaler to the data and then transforms the data according to the fitted scaler | scaler.fit_transform(array) |
Introduction to Neural Networks |
scaler.transform() |
Applies a previously fitted scaler to new data | scaler.transform(array) |
Introduction to Neural Networks |
tf.keras.Sequential() |
Creates a linear stack of layers for building a neural network model | tf.keras.Sequential |
Backpropagation |
model.compile() |
Used to configure the learning process of a neural network model before training | model.compile |
Backpropagation |
Visualizing Data | |||
boxplot() |
Creates a box-and-whisker plot | plt.boxplot(array) |
Encoding Univariate Data |
hist() |
Creates a histogram | plt.hist (array) |
Encoding Univariate Data |
plot() |
Creates 2D line plots such as a time series graph | plt.plot |
Graphing Probability Distributions |
bar() |
Creates a bar chart | plt.bar |
Graphing Probability Distributions |
imshow() |
Displays an image on a 2D regular raster, such as a heatmap | plt.imshow(array) |
Geospatial and Heatmap Data Visualization Using Python |
heatmap() |
Creates a heatmap visualization | sns.heatmap(array) |
Geospatial and Heatmap Data Visualization Using Python |
colorbar() |
Adds a colormap to a figure | plt.colorbar() |
Multivariate and Network Data Visualization Using Python |
corr() |
Calculates the pairwise correlations of columns in a DataFrame | dataframe.corr() |
Multivariate and Network Data Visualization Using Python |
add.subplot() |
Adds a subplot to a figure stored in fig | fig.add.subplot |
Multivariate and Network Data Visualization Using Python |
ax.scatter() |
Creates a scatterplot | ax.scatter |
Multivariate and Network Data Visualization Using Python |
Reporting Results | |||
plot_tree() |
Creates a visualization of a decision tree | plot_tree |
Validating Your Model |
DataFrame.info() |
Provides a concise summary of a DataFrame's structure and content | DataFrame.info() |
Validating Your Model |
DataFrame.drop() |
Removes rows or columns from a DataFrame | DataFrame.drop |
Validating Your Model |
score() |
Evaluates the performance of a trained model on a given dataset | model.score |
Validating Your Model |
dt.get_depth() |
Retrieves the depth of the decision tree, dt | dt.get_depth() |
Validating Your Model |
cross_val_score() |
Evaluates a model's performance using cross-validation | cross_val_score |
Validating Your Model |
GridSearchCV () |
Search for the best parameters for a specified estimator, with k-fold cross-validation | GridSearchCV |
Validating Your Model |