Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

Learning Outcomes

By the end of this section, you should be able to:

2.4.1 Apply methods to deal with missing data and outliers.
2.4.2 Explain data standardization techniques, such as normalization, transformation, and aggregation.
2.4.3 Identify sources of noise in data and apply various data preprocessing methods to reduce noise.

Data cleaning and preprocessing is an important stage in any data science task. It refers to the technique of organizing and converting raw data into usable structures for further analysis. It involves extracting irrelevant or duplicate data, handling missing values, and correcting errors or inconsistencies. This ensures that the data is accurate, comprehensive, and ready for analysis. Data cleaning and preprocessing typically involve the following steps:

Data integration. Data integration refers to merging data from multiple sources into a single dataset.
Data cleaning. In this step, data is assessed for any errors or inconsistencies, and appropriate actions are taken to correct them. This may include removing duplicate values, handling missing data, and correcting formatting misconceptions.
Data transformation. This step prepares the data for the next step by transforming the data into a format that is suitable for further analysis. This may involve converting data types, scaling or normalizing numerical data, or encoding categorical variables.
Data reduction. If the dataset contains a large number of columns or features, data reduction techniques may be used to select only the most appropriate ones for analysis.
Data discretization. Data discretization involves grouping continuous data into categories or ranges, which can help facilitate analysis.
Data sampling. In some cases, the data may be too large to analyze in its entirety. In such cases, a sample of the data can be taken for analysis while still maintaining the overall characteristics of the original dataset.

The goal of data cleaning and preprocessing is to guarantee that the data used for analysis is accurate, consistent, and relevant. It helps to improve the quality of the results and increase the efficiency of the analysis process. A well-prepared dataset can lead to more accurate insights and better decision-making.

Handling Missing Data and Outliers

Missing data refers to any data points or values that are not present in a dataset. This could be due to data collection errors, data corruption, or nonresponse from participants in a study. Missing data can impact the accuracy and validity of an analysis, as it reduces the sample size and potentially introduces bias.

Some specific examples of missing data include the following:

A survey participant forgetting to answer a question
A malfunction in data collection equipment resulting in missing values
A participant choosing not to answer a question due to sensitivity or discomfort

An outlier is a data point that differs significantly from other data points in a given dataset. This can be due to human error, measurement error, or a true outlier value in the data. Outliers can skew statistical analysis and bias results, which is why it is important to identify and handle them properly before analysis.

Missing data and outliers are common problems that can affect the accuracy and reliability of results. It is important to identify and handle these issues properly to ensure the integrity of the data and the validity of the analysis. You will find more details about outliers in Measures of Center, but here we summarize the measures typically used to handle missing data and outliers in a data science project:

Identify the missing data and outliers. The first stage is to identify which data points are missing or appear to be outliers. This can be done through visualization techniques, such as scatterplots, box plots, IQR (interquartile range), or histograms, or through statistical methods, such as calculating the mean, median, and standard deviation (see Measures of Center and Measures of Variation as well as Encoding Univariate Data).

It is important to distinguish between different types of missing data. MCAR (missing completely at random) data is missing data not related to any other variables, with no underlying cause for its absence. Consider data collection with a survey asking about driving habits. One of the demographic questions asks for income level. Some respondents accidentally skip this question, and so there is missing data for income, but this is not related to the variables being collected related to driving habits.

MAR (missing at random) data is missing data related to other variables but not to any unknown or unmeasured variables. As an example, during data collection, a survey is sent to respondents and the survey asks about loneliness. One of the questions asks about memory retention. Some older respondents might skip this question since they may be unwilling to share this type of information. The likelihood of missing data for loneliness factors is related to age (older respondents). Thus, the missing data is related to an observed variable of age but not directly related to loneliness measurements.

MNAR (missing not at random) data refers to a situation in which the absence of data depends on observed data but not on unobserved data. For example, during data collection, a survey is sent to respondents and the survey asks about debt levels. One of the questions asks about outstanding debt that the customers have such as credit card debt. Some respondents with high credit card debt are less likely to respond to certain questions. Here the missing credit card information is related to the unobserved debt levels.
Determine the reasons behind missing data and outliers. It is helpful to understand the reasons behind the missing data and outliers. Some expected reasons for missing data include measurement errors, human error, or data not being collected for a particular variable. Similarly, outliers can be caused by incorrect data entry, measurement errors, or genuine extreme values in the data.
Determine how to solve missing data issues. Several approaches can be utilized to handle missing data. One option is to withdraw the missing data altogether, but this can lead to a loss of important information. Other methods include imputation, where the absent values are replaced with estimated values based on the remaining data or using predictive models to fill in the missing values.
Consider the influence of outliers. Outliers can greatly affect the results of the analysis, so it is important to carefully consider their impact. One approach is to delete the outliers from the dataset, but this can also lead to a loss of valuable information. Another option is to deal with the outliers as an individual group and analyze them separately from the rest of the data.
Use robust statistical methods. When dealing with outliers, it is important to use statistical methods that are not affected by extreme values. This includes using median instead of mean and using nonparametric tests instead of parametric tests, as explained in Statistical Inference and Confidence Intervals.
Validate the results. After handling the missing data and outliers, it is important to validate the results to ensure that they are robust and accurate. This can be done through various methods, such as cross-validation or comparing the results to external data sources.

Handling missing data and outliers in a data science task requires careful consideration and appropriate methods. It is important to understand the reasons behind these issues and to carefully document the process to ensure the validity of the results.

Example 2.6

Problem

Starting in 1939, the United States Bureau of Labor Statistics tracked employment on a monthly basis. The number of employers in the construction field between 1939 and 2019 is presented in Figure 2.2.

Determine if there is any outlier in the dataset that deviates significantly from the overall trend.
In the event that the outlier is not a reflection of real employment numbers, how would you handle the outliers?

Figure 2.2 US Retail Employment with Outliers
(Data source: Bureau of Labor Statistics; credit: modification of work by Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on April 23, 2024)

Solution

Based on the visual evidence, it appears that an outlier is present in the employment data for March 28, 1990. The employment level during this month shows a significant jump from approximately 5,400 to 9,500, exceeding the overall maximum employment level recorded between 1939 and 2019.
A possible approach to addressing this outlier is to replace it with median, calculated by taking the mean of the points before and after the outlier. This method can help to improve the smoothness and realism of the employment curve as well as mitigate any potential impacts the outlier may have on statistical analysis or modeling processes. (See Table 2.3.)

Date	Number of Employers × 1000
1/1/1990	4974
2/1/1990	4933
3/1/1990	4989
4/1/1990	5174
5/1/1990	5370
6/1/1990	9523
7/1/1990	5580
8/1/1990	5581
9/1/1990	5487
10/1/1990	5379
11/1/1990	5208
12/1/1990	4963

Table 2.3 Monthly Employment over 1990

The median employment level from May 1, 1990, to July 1, 1990, is 5,289, as a replacement value for the outlier on May 28, 1990.

The adjusted data is presented in Figure 2.3.

A line graph illustrating the number of employees (in thousands) from March 21, 1938, to March 9, 2020, showing an overall increasing trend. The y-axis represents the number of employees in thousands, marked in intervals of 1,000 (1,000, 2,000, 3,000, and so on, up to 10,000). The x-axis represents the years from 1938 to 2020, marked in intervals of three years (1937, 1940, 1943, 1946, and so on.

Figure 2.3 US Retail Employment Without Outliers
(Data source: Bureau of Labor Statistics; credit: modification of work by Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on April 23, 2024)

Data Standardization, Transformation, and Validation

Data standardization, transformation, and validation are critical steps in the data analysis preprocessing pipeline. Data standardization is the process of systematically transforming collected information into a consistent and manageable format. This procedure involves the elimination of inconsistencies, errors, and duplicates as well as converting data from various sources into a unified format, often termed a normal form (defined in the next section). Data transformation involves modifying the data to make it more suitable for the analysis that is planned. Data validation ensures that the data is accurate and consistent and meets certain criteria or standards. Table 2.4 summarizes these processes, which are explored in more depth later in the chapter.

	Purpose	Techniques	Example
Normalization	To eliminate redundant data and ensure data dependencies make sense	Involve breaking down a larger database into smaller, more manageable tables and establishing relationships between them.	A customer database table with redundant data can be normalized by splitting it into two tables for customer information and orders and establishing the relationship between them.
Transformation	To make the data more consistent, coherent, and meaningful for analysis.	Include data cleaning, data merging, data splitting, data conversion, and data aggregation.	A dataset with different date formats, such as “MM/DD/YYYY” and “YYYY/MM/DD.” Data transformation can be used to convert all dates to a single format
Validation	To improve the quality of data used in analysis. It ensures that the data is accurate, relevant, and consistent.	Include data profiling, data audits, and data cleansing.	In a survey, the respondent’s age is recorded as 150 years. Through data validation, this value can be identified as erroneous.

Table 2.4 Comparison of Data Standardization, Transformation, and Validation Processes

Data Normalization

The first step in standardizing data is to establish guidelines and rules for formatting and structuring the data. This may include setting conventions for naming, data types, and formatting A normal form (NF) is a guideline or set of rules used in database design to ensure that a database is well-structured, organized, and free from certain types of data irregularities, such as redundancy and inconsistency. The most commonly used normal forms are 1NF, 2NF, 3NF (First, Second, and Third Normal Form), and BCNF (Boyce-Codd Normal Form).

Normalization is the process of applying these rules to a database. The data must be organized and cleaned, which involves removing duplicates and erroneous data, filling in missing values, and logically arranging the data. To uphold data standardization, regular quality control measures should be implemented, including periodic data audits to ascertain the accuracy and consistency of the data. It is also important to document the standardization process, including the guidelines and procedures followed. Periodic review and updates of data standards are necessary to ensure the ongoing reliability and relevance of the data.

Data normalization ensures that data is maintainable regardless of its source. Consider a marketing team collecting data on their customers’ purchasing behavior so they can make some decisions about product placement. The data is collected from numerous sources, such as online sales transactions, in-store purchases, and customer feedback surveys. In its raw form, this data could be disorganized and unreliable, making it difficult to analyze. It is hard to draw meaningful insights from badly organized data.

To normalize this data, the marketing team would go through multiple steps. First, they would identify the key data elements, such as customer name, product purchased, and transaction date. Then, they would ensure that these elements are consistently formatted in all data sources. For instance, they might employ the same date format across all data sources or standardize customer names to the first name and the last name fields. Subsequently, they would eliminate any redundant or irrelevant data elements. In this case, if the data is collected from both online and in-store purchases, they might choose one or the other to avoid duplication. The marketing team would ensure that the data is properly structured and organized. This could involve creating a data table with domains for each data element, such as a customer ID, product code, and purchase amount. By normalizing the data, the marketing team can efficiently follow and investigate the customers' purchasing behavior, identify patterns and trends, and make data-driven judgments to enhance their marketing systems.

The normalization formula is a statistical formula utilized to scale down a dataset, typically to between one and zero. The largest data would have a normalized value of one, and the smallest data point would be zero. Note that the presence of outliers can significantly impact the calculated values of minimum/maximum. As such, it is important to first remove any outliers from the dataset before performing normalization. This ensures more accurate and representative results.

The normalization formula:

x_{norm} = \frac{x - min}{max - min}

Example 2.7

Problem

A retail company with eight branches across the country wants to analyze its product sales to identify its top-selling items. The company collects data from each branch and stores it in Table 2.5, listing the sales and profits for each product. From previous reports, it discovered that its top-selling products are jewelry, TV accessories, beauty products, DVDs, kids' toys, video games, women's boutique apparel, and designer and fashion sunglasses. However, the company wants to arrange these products in order from highest to lowest based on best sales and profits. Determine which product is the top-selling product by normalizing the data in Table 2.5.

Branch	Product	Sales ($)	Profits ($)
Branch 1	Jewelry	50000	20000
Branch 2	TV Accessories	25000	12500
Branch 3	Beauty Products	30000	15000
Branch 4	DVDs	15000	7500
Branch 5	Kids’ Toys	45000	22500
Branch 6	Video Games	35000	17500
Branch 7	Women’s Boutique Apparel	40000	20000
Branch 8	Designer & Fashion Sunglasses	55000	27500

Table 2.5 Retail Company Sales and Profits

Solution

Using the normalization formula, the maximum sale is $55,000, and the minimum sale is $15,000, as shown in Table 2.6.

Branch	Product	Sales ($)	Profits ($)	Normalization Scale
Branch 1	Jewelry	50000	20000	0.88
Branch 2	TV Accessories	25000	12500	0.25
Branch 3	Beauty Products	30000	15000	0.38
Branch 4	DVDs	15000	7500	0.00
Branch 5	Kids’ Toys	45000	22500	0.75
Branch 6	Video Games	35000	17500	0.50
Branch 7	Women’s Boutique Apparel	40000	20000	0.63
Branch 8	Designer & Fashion Sunglasses	55000	27500	1.00

Table 2.6 Data of Retail Company Sales on Normalization Scale

Overall, the retail company’s top-selling products generate the most profits for the company, with “Designer & Fashion Sunglasses” being the highest in the normalization scale. The company can use this information to focus on promoting and restocking these items at each branch to continue driving sales and profits.

Data Transformation

Data transformation is a statistical technique used to modify the original structure of the data in order to make it more suitable for analysis. Data transformation can involve various mathematical operations such as logarithmic, square root, or exponential transformations. One of the main reasons for transforming data is to address issues related to statistical assumptions. For example, some statistical models assume that the data is normally distributed. If the data is not distributed normally, this can lead to incorrect results and interpretations. In such cases, transforming the data can help to make it closer to a normal distribution and improve the accuracy of the analysis.

One commonly used data transformation technique is the log transformation, which requires taking the logarithm of the data values. Log transformation is often used when the data is highly skewed, meaning most of the data points fall toward one end of the distribution. This can cause problems in data analysis as the data may not follow a normal distribution. By taking the logarithm of the values, the distribution can be shifted toward a more symmetrical shape, making it easier to analyze. Another common transformation technique is the square root transformation, which involves taking the square root of the data values. Similar to log transformation, square root transformation is often used to address issues with skewness and make the data more normally distributed. Square root transformation is also useful when the data contains values close to zero, as taking the square root of these values can bring them closer to the rest of the data and reduce the impact of extreme values. Exponential transformations involve taking the exponent of the data values. Whichever operation is used, data transformation can be a useful tool for data analysts to address issues with data distribution and improve the accuracy of their analyses.

Dealing with Noisy Data

Noisy data refers to data that retains errors, outliers, or irrelevant information that can conceal true patterns and relationships within the dataset. The presence of noisy data in the dataset causes difficulty in drawing accurate conclusions and making predictions from the data. Most noisy data is caused by human errors in data entry, technical errors in data collection or transmission, or natural variability in the data itself. Noisy data is removed and cleaned by identifying and correcting errors, removing outliers, and filtering out irrelevant information. Noisy data can negatively impact data analysis and modeling, and it may indicate that there are issues with the model's structure or assumptions. Noisy data is unwanted information that can be removed.

Strategies to reduce the noisy data are summarized in Table 2.7.

Strategy	Example
Data cleaning	Removing duplicate or irrelevant data from a dataset, such as deleting repeated rows in a spreadsheet or filtering out incomplete or error-prone data entries.
Data smoothing	A technique used in data analysis to remove outliers or noise from a dataset in order to reveal underlying patterns or trends. One example is smoothing a dataset of daily stock market index values over the course of a month. The values may fluctuate greatly on a day-to-day basis, making it difficult to see any overall trend. By calculating a seven-day average, we can smooth out these fluctuations and see the overall trend of the index over the course of the month.
Imputation	An example of imputation is in a hospital setting where a patient's medical history is incomplete due to missing information. The hospital staff can use imputation to estimate the missing data based on the patient's known medical conditions and past treatments.
Binning	A researcher is studying the age demographics of a population in a city. Instead of looking at individual ages, the researcher decides to bin the data into age groups of 10 years (e.g., 0–10, 10–20, 20–30, etc.). This allows for a more comprehensive and easier analysis of the data.
Data transformation	Consider the following dataset that shows the number of COVID-19 cases recorded in a country at different points in time—01/01/2020, 02/02/2020, 03/01/2020—are 1000, 10000, 100000, respectively. To transform this data using log, we can take the log base 10 of the number of cases column. This would result in a transformed data as follows: 01/01/2020, 02/02/2020, 03/01/2020 are 3, 4, 5, respectively.
Dimensionality reduction	The original dataset would have high dimensionality due to the large number of variables (100 companies) and time points (5 years). By applying principal component analysis, we can reduce the dimensionality of the dataset to a few principal components that represent the overall trends and patterns in the stock market.
Ensemble methods	An example of an ensemble method is the random forest algorithm. It combines multiple decision trees, each trained on a random subset of the data, to make a more accurate prediction. This helps reduce overfitting and increase the overall performance of the model. The final prediction is made by aggregating the predictions of each individual tree.

Table 2.7 Strategies to Reduce Noisy Data

Data Validation

Data validation is the process of ensuring the accuracy and quality of data examined against defined rules and standards. This approach involves identifying and correcting any errors or inconsistencies in the collected data as well as ensuring that the data is relevant and reliable for analysis. Data validation can be performed through a variety of techniques, including manual checks, automated scripts, and statistical analysis. Some common inspections in data validation include checking for duplicates, checking for mislaid values, and verifying data against external sources or references. Before collecting data, it is important to determine the conditions or criteria that the data needs to meet to be considered valid. This can include factors such as precision, completeness, consistency, and timeliness. For example, a company may set up a data validation process to ensure that all customer information entered into its database follows a specific format. This would involve checking for correct spellings and proper formatting of phone numbers and addresses and validating the correctness of customer names and account numbers. The data would also be checked against external sources, such as official government records, to verify the accuracy of the information. Any discrepancies or errors would be flagged for correction before the data is used for analysis or decision-making purposes. Through this data validation process, the company can ensure that its customer data is accurate, reliable, and compliant with industry standards.

Another method to assess the data is to cross-reference it with reliable sources to identify any discrepancies or errors in the collected data. A number of tools and techniques can be used to validate data. These can include statistical analysis, data sampling, data profiling, and data auditing. It is important to identify and remove outliers before validating the data. Reasonability checks involve using common sense to check if the data is logical and makes sense—for example, checking if a person's age is within a reasonable range or if a company's revenue is within a reasonable range for its industry. If possible, data should be verified with the source to ensure its accuracy. This can involve contacting the person or organization who provided the data or checking against official records. It is always a good idea to involve multiple team members or experts in the validation process to catch any errors or inconsistencies that may have been overlooked by a single person. Documentation of the validation process, including the steps taken and any issues identified, is important in future data audits or for reference purposes. Data validation is a continuous process, and data should be monitored and updated to ensure its accuracy and validity.

Consider a marketing company conducting a survey on customer satisfaction for a new product launch. The company collected data from 1,000 respondents, but when the company started analyzing the data, it noticed several inconsistencies and missing values. The company's data analyst realized that the data standardization and validation processes were not adequately performed before the survey results were recorded. To correct this issue, the data analyst first identified and removed all duplicate entries, reducing the total number of responses to 900. Then, they used automated scripts to identify and fill in missing values, which accounted for 95 responses. The remaining 805 responses were then checked for data accuracy using statistical analysis. After the data standardization and validation process, the company had a clean and reliable dataset of 805 responses. The results showed that the product had a satisfaction rate of 85%, which was significantly higher than the initial analysis of 78%. As a result of this correction, the marketing team was able to confidently report the actual satisfaction rate and make better-informed decisions for future product development.

Data Aggregation

Data aggregation is the process by which information from multiple origins is gathered and merged into a single set that provides insights and meaningful conclusions. It involves gathering, managing, and delivering data from different sources in a structured manner to facilitate analysis and decision-making. Data aggregation can be performed manually or by using automated tools and techniques. The data aggregation process is utilized to identify patterns and trends between different data points, which extracts valuable insights. Some standard types of data aggregation are spatial aggregation, statistical aggregation, attribute aggregation, and temporal aggregation. This methodology is commonly operated in marketing, finance, health care, and research to analyze large sets of data. Data aggregation is used in various industries to combine and analyze large sets of data. Examples include calculating total sales for a company from different departments, determining average temperature for a region including multiple cities, and analyzing website traffic by country. It is also used in fields such as stock market indices, population growth, customer satisfaction scores, credit scores, and airline flight delays. Governments and utility companies also utilize data aggregation to study energy consumption patterns.

Text Preprocessing

Text preprocessing is a technique of preparing data in text format for further analysis and natural language processing tasks. It involves transforming unstructured text data into a more structured format to be interpreted by algorithms and models. Some common techniques used in text preprocessing are reviewed in Table 2.8.

Preprocessing Technique	Explanation	Example
Tokenization	Breaking text data into individual words or phrases (tokens)	Original Text: “Tokenization is the process of breaking down a sentence, paragraph, or entire text into smaller parts called tokens.” Tokenized Text: “Tokenization”, “is”, “the”, “process”, “of”, “breaking”, “down”, “a”, “sentence”, “,”, “paragraph”, “,”, “or”, “entire”, “text”, “into”, “smaller”, “parts”, “called”, “tokens”, “.”
Lowercasing	Converting all text to lowercase to avoid multiple representations of the identical word	Consider the following sentence: “John likes to eat pizza.” After lowercasing it, the sentence becomes “john likes to eat pizza.”
Stopwords removal	Filtering out commonly occurring words that do not add meaning or context	Consider: “The sun was shining bright in the sky and the birds were chirping. It was a lovely day in the park and people were enjoying the beautiful weather.” After removing stopwords, the paragraph would be transformed into: “Sun shining bright sky birds chirping. Lovely day park people enjoying beautiful weather.”
Lemmatization and stemming	Reducing words to their root forms to reduce complexity and improve model performance	For example, the words “running,” “runs,” and “ran” would all be lemmatized to the base form ”run.”
Part-of-speech tagging	Identifying the grammatical components of each word where each word in a sentence is assigned a specific part-of-speech tag (e.g., noun, verb, or adjective)	In the sentence “I went to the market to buy some fruits,” the words “went” and “buy” would be tagged as verbs, “market” and “fruits” as a noun, and “some” as adjectives.
Named entity recognition	Recognizing and categorizing named entities such as people, places, and organizations	Consider the following text: “John went to Paris last summer with his colleagues at Microsoft.” By using named entity recognition, we can tag the named entities in this sentence as follows: “John (person) went to Paris (location) last summer with his colleagues at Microsoft (organization).”
Sentiment analysis	Identifying and categorizing the emotions expressed in text	Let’s say a customer has left a review for a new laptop they purchased. The review reads: “I am extremely satisfied with my purchase. The laptop has exceeded all of my expectations and has greatly improved my work efficiency. Thanks for an amazing product!” To perform sentiment analysis, the text will first undergo preprocessing, which involves cleaning and preparing the text data for analysis. This may include removing punctuation, converting all letters to lowercase, and removing stopwords (commonly used words that do not add much meaning to a sentence, such as “the” or “and”). After preprocessing, sentiment analysis techniques will be applied to analyze the emotions and opinions expressed in the review. The analysis may identify key positive words such as “satisfied” and “amazing” and measure their overall sentiment score. It may also take into account the context and tone of the review to accurately determine the sentiment.
Spell-checking and correction	Correcting spelling errors to improve accuracy	Suppose we have a text like: “The writting was very inpprecise and had many trgical mistakes.” With spell-checking and correction, this text can be processed and corrected to: “The writing was very imprecise and had many tragic mistakes.” This involves identifying and correcting misspelled words. In this case, “writting” was corrected to “writing,” “inpprecise” to “imprecise,” and “trgical” to “tragic.” This not only improves the readability and accuracy of the text but also helps in better understanding and analysis of the text.
Encoding	Converting text into a numerical representation for machine learning algorithms to process	Let’s say we have a dataset of customer reviews for a restaurant. Each review is a string of text, such as: “I had a great experience at this restaurant, the food was delicious and the service was top-notch.” To encode this text data, we can use techniques such as one-hot encoding or word embedding. One-hot encoding involves creating a binary vector for each word in the review, where the size of the vector is equal to the size of the vocabulary. For example, if the review contains the words “great,” “experience,” “restaurant,” “food,” “delicious,” “service,” and “top-notch,” the one-hot encoded vectors for these words would be: great: [1, 0, 0, 0, 0, 0, 0]; experience: [0, 1, 0, 0, 0, 0, 0]; restaurant: [0, 0, 1, 0, 0, 0, 0]; food: [0, 0, 0, 1, 0, 0, 0]; delicious: [0, 0, 0, 0, 1, 0, 0]; service: [0, 0, 0, 0, 0, 1, 0]; top-notch: [0, 0, 0, 0, 0, 0, 1]. These one-hot encoded vectors can now be used as input features for machine learning models.
Removing special characters and punctuation	Simplifying the text for analysis	Consider the input: “Hello, this is a text with @special# characters!*” and the output: “Hello this is a text with special characters”

Table 2.8 Summary of Text Preprocessing Techniques

Text preprocessing is crucial to effectively use text data for tasks such as text classification and information extraction. By transforming the raw text data, the accuracy and performance of machine learning models can greatly improve and provide meaningful insights and predictions.

Text preprocessing is especially important in artificial intelligence, as it can lay the foundation for effective text analysis, classification, information retrieval, sentiment analysis, and many other natural language processing tasks (see Natural Language Processing). A well-designed preprocessing pipeline can lead to better model performance, improved efficiency, and more accurate insights from text data.

An example of text preprocessing in artificial intelligence involves the conversion of text data into a more standardized and consistent format. This can include tasks such as removing accents and diacritics, expanding contractions, and converting numbers into their written representations (e.g., “10” to “ten”). Normalizing text data helps to reduce the number of variations and therefore improves the efficiency and accuracy of natural language processing tasks. It also makes the data more easily interpretable for machine learning algorithms. For example, when analyzing customer reviews, it may be beneficial to normalize text data so that variations of the same word (e.g., “colour” and “color”) are treated as the same, providing a more accurate understanding of sentiment toward a product or service.

2.4 Data Cleaning and Preprocessing

Learning Outcomes

Handling Missing Data and Outliers

Problem

Solution

Data Standardization, Transformation, and Validation

Data Normalization

Problem

Solution

Data Transformation

Dealing with Noisy Data

Data Validation

Data Aggregation

Text Preprocessing