Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

7.5 Natural Language Processing

Principles of Data Science7.5 Natural Language Processing

Learning Outcomes

By the end of this section, you should be able to:

  • 7.5.1 Provide a brief history of the significant developments of natural language processing.
  • 7.5.2 Discuss the importance of speech recognition and text-to-speech algorithms in everyday life and provide some examples.
  • 7.5.3 Discuss how ChatGPT may be used effectively to speed certain kinds of tasks while also recognizing ethical issues in the use of ChatGPT and similar technologies.

AI and deep learning have progressed rapidly in the last decade. Some of the most significant breakthroughs have come from natural language processing, or NLP, which is an area of AI concerned with recognizing written or spoken language and generating new language content. From its humble beginnings in the 1950s, marked by rudimentary language translation programs and rule-based systems, through the development of neural network models, NLP has undergone a remarkable evolution. At the time of this writing, powerful NLP platforms such as OpenAI’s ChatGPT and Microsoft’s Copilot have captured everyone’s interest with their uncanny ability to understand and respond to complex prompts, engaging users in natural and coherent conversations that simulate humanlike interactions. ChatGPT and Copilot are also known as a large language model, or LLM, which is a type of NLP model that is trained on massive datasets containing billions or even trillions of words. For an example of what ChatGPT looks like in action, see Figure 7.12. You can also go to https://chat.openai.com/ to use ChatGPT for free!

A screenshot of a ChatGPT conversation using the prompt: “Hi! Tell me a little bit about yourself.”
Figure 7.12 ChatGPT Conversation
OpenAI. (2024). ChatGPT (June 16 version) [Large language model]. https://chat.openai.com/chat (Created at: https://chat.openai.com/ using the prompt: “Hi! Tell me a little about yourself,” followed by the prompt: “That’s great! Could you tell me more about how you understand language?” Note: Each use of this prompt will produce varying output.)

NLP models may act as force multipliers, enhancing human capabilities by automating routine tasks and augmenting human intelligence. For example, they can assist in preparing form letters, summarizing information, analyzing sentiment in large volumes of text, extracting key insights from documents, translating from one language to another, and even generating creative content such as articles, stories, or poetry. You may be wondering whether NLP was used to help write this text. While almost every aspect of the writing was indeed handled by humans, ChatGPT has often been helpful in brainstorming ideas, finding out basic information, and providing a starting point for some of the content writing. You are currently experiencing the future of AI, where more and more of the tasks that humans consider mundane or routine can be offloaded to computers!

A Brief History of NLP

Key milestones, from the creation of ELIZA (an early “chatbot,” or conversational agent—see Figure 7.13) in the 1960s to the advent of statistical language models in the 1990s, have paved the way for the emergence of the so-called transformer models and state-of-the-art language models like ChatGPT in the 2010s and beyond. Transformer models are a type of neural network architecture that use a very sophisticated internal memory system to capture long-range and large-scale dependencies in the input data. Today, NLPs power a diverse array of real-world applications.

A screenshot of Eliza with a conversation that starts with, “Hello, I am Eliza” and discusses being depressed or happy.
Figure 7.13 A Screenshot of Eliza (credit: "ELIZA mostly restates content entered by the user to elicit further information. This tactic can help people feel like they're having a meaningful conversation." by Rosenfeld Media/Flickr, CC BY 2.0)

Here is a timeline of significant milestones in the development of NLPs. This topic is too vast to cover in this text, and so we will only mention some of the important models without going into detail. Table 7.3 highlights several key developments.

Year Development
Mid-1960s ELIZA, a computer program capable of simulating conversation, was created by Joseph Weizenbaum at MIT.
Late 1980s to 1990s Statistical language models such as hidden Markov Models (HMMs) (a type of statistical model useful for predicting sequential data, such as speech and handwriting) and n-gram models (models based on predicting the next, or nth, term of a sequence based on the previous n1n1 terms), gained popularity for tasks like speech recognition and machine translation.
2001 The introduction of the BLEU metric for evaluating machine translation systems by Kishore Papineni and colleagues.
2003 Moses, an open-source statistical machine translation system, was introduced by Philipp Koehn and colleagues.
2013 Word2vec, a word embedding technique, was introduced by Tomas Mikolov and colleagues at Google. Word embedding refers to translating words into vector representations in high-dimensional vector spaces in such a way that preserves certain relationships among the words.
2017 The Transformer architecture, introduced in the paper “Attention Is All You Need” by Ashish Vaswani et al., revolutionized the world of NLPs. Instead of considering words one by one, a transformer model looks at sentences, paragraphs, and larger groups of words using a paradigm called self-attention, which allows the model to evaluate the importance of different words or phrases within a larger context.
2018 OpenAI introduced the first GPT (Generative Pretrained Transformer), a large-scale unsupervised language model.
2018 BERT (Bidirectional Encoder Representations from Transformers), a transformer-based language model, was introduced by researchers at Google.
2020 OpenAI released GPT-3 (Generative Pretrained Transformer 3), an even larger and more powerful language model, with up to 175 billion parameters.
2022 ChatGPT, a variant of GPT-3 fine-tuned for conversational tasks, was released, gaining widespread attention for its impressive performance in generating humanlike responses.
2023 Enhancements focused on improving LLM efficiency, accuracy, and scalability. Release of GPT-4, which incorporates more training with human feedback, continuous improvement from real-world data, and more robust quality control.
Table 7.3 Brief History of NLP

While most of the development has been in areas of text recognition and generation, significant innovations have also been made in areas of art, music, video editing, and even computer coding.

NLPs in Generative Art

Generative art is the use of AI tools to enhance or create new artistic and graphic works. In the past, people thought that one defining feature that set humans apart from animals or even from computers is our ability to make and appreciate art. Surveying the artistic genius of Leonardo Da Vinci, Vincent van Gogh, Pablo Picasso, Salvador Dalí, Frida Kahlo, and Jean-Michel Basquiat, it is difficult to imagine a machine producing anything like the masterworks created by human artists. However, we are beginning to witness a revolution in art, led by artificial intelligence! All judgements of quality are left up to the reader, but here are a few tools available now for generative art.

OpenArt: OpenArt utilizes neural style transfer techniques to transform photos into artworks inspired by famous artists’ styles, or create your own images and animations based on prompts.

DALL-E and Craiyon (formerly DALL-E mini): Developed by OpenAI, Craiyon generates images from textual descriptions, demonstrating the ability to create diverse and imaginative visual content. Figure 7.14 shows some examples of artistic works created by Craiyon.

A grid of nine images created by artificial intelligence of a horse drinking a cocktail. Each is in a slightly different style.
Figure 7.14 AI Art Created by Craiyon. (These nine images generated using Craiyon [link to: https://www.craiyon.com/ from the prompt “A surreal painting of a horse sitting at a bar, drinking a cherry milkshake.” Note: Each use of this prompt will produce varying output.)

Magenta Project: Magenta is an open-source research project by Google that explores the intersection of AI and creativity, including music and art generation. It offers various tools and models for generating music and visual art using deep learning techniques, such as neural style transfer and image generation models.

Music is another area of artistic expression that was until recently regarded as solely in the domain of human minds. Nowadays, there are numerous AI tools for processing, analyzing, and creating music.

Jukebox: An OpenAI project, Jukebox employs machine learning algorithms to automatically compose personalized music tracks for videos and other content.

Lyria model by Google Deepmind: Lyria is described as the most advanced AI music generation model to date. It excels at generating high-quality music with both instrumentals and vocals, tackling challenges such as maintaining musical continuity across phrases and handling multiple voices and instruments simultaneously.

Amadeus Code: Amadeus Code is an AI-powered music composition platform that uses deep learning algorithms to generate melodies and chord progressions. It allows users to input musical ideas and preferences and generates original compositions based on their input.

Video Editing and Captioning

There are a number of AI-assisted technologies to aid in video editing and captioning of videos.

Descript: Descript integrates NLP-based transcription and editing tools to facilitate easy editing of audio and video content through text-based commands and manipulation.

Kapwing: Kapwing utilizes NLP for automatic video captioning, enabling users to quickly generate accurate captions for their videos.

Simon Says: Simon Says is an AI-powered transcription and translation platform that uses NLP to transcribe audio and video files in multiple languages. It offers features such as speaker identification, timecode alignment, and captioning, helping video editors streamline the post-production workflow.

Computer Coding

At the time of this writing, it is common knowledge among researchers and programmers that NLPs like ChatGPT can generate code in any common coding language based on user prompts. Moreover, online coding platforms such as Google Colab (which is the platform used throughout this text for Python code examples) offer AI assistance in creating and debugging code. Here are a few more resources that use AI to produce computer code.

GitHub Copilot: Not to be confused with Microsoft Copilot, GitHub Copilot was developed by GitHub in collaboration with OpenAI to be an AI-powered code completion tool that uses NLP to suggest code snippets and provide context-aware code suggestions directly within integrated development environments (IDEs) like Visual Studio Code.

TabNine: TabNine is an AI-powered code completion tool that uses a combination of deep learning and NLP to provide intelligent code suggestions as developers type. It supports various programming languages and integrates with popular code editors.

Sonar: Sonar is an AI-powered static code analysis platform that uses machine learning and NLP techniques to identify potential bugs, security vulnerabilities, and performance issues in code. It provides intelligent suggestions for code improvements, cleaning code, and best practices in programming.

CoNaLa: CoNaLa (Code/Natural Language) is a dataset and research project that aims to bridge the gap between natural language descriptions and code snippets. It uses NLP techniques to generate code snippets from natural language descriptions of programming tasks, helping automate code synthesis.

Speech Recognition and Text-to-Speech

NLP models play a crucial role in both speech recognition and text-to-speech (TTS) applications, enabling computers to understand and generate human speech.

In speech recognition, NLP models process spoken language and convert it into text. This involves several steps:

  1. Acoustic Analysis: The speech signal is converted into a sequence of feature vectors using deep learning techniques.
  2. Language Modeling: Further processing allows the NLP model to analyze and understand the linguistic structure of the spoken words, considering factors like grammar, vocabulary, and context. This helps in recognizing words and phrases accurately, even in noisy or ambiguous environments.
  3. Decoding: Once the speech signal has been analyzed acoustically and linguistically, decoding algorithms are used to match the observed features with the most likely sequence of words or phrases.

In text-to-speech (TTS) applications, NLP models perform the reverse process: they convert text into spoken language. This involves the following steps:

  1. Text Analysis: Using sophisticated deep learning models, the input text is analyzed and processed to extract linguistic features such as words, sentences, and punctuation.
  2. Prosody Modeling: To avoid sounding “like a robot,” NLP models consider linguistic features like stress, intonation, and rhythm to generate natural-sounding speech.
  3. Speech Synthesis: The synthesized speech is generated using digital signal processing techniques to produce sound waves that closely resemble human speech. This can involve concatenating prerecorded speech segments, generating speech from scratch using parametric synthesis, or using neural network–based generative models.

Future Directions for NLP and LLM

Where will natural language processing and large language models be in the future? Given the rapid pace of development over the past few decades, it is hard to predict. Some of the more recent tools such as ChatGPT and Microsoft Copilot are already making a huge impact in the workplace, schools, and elsewhere.

A number of programs have AI embedded in their applications, a feature that is expected to increase as AI technology advances. AI has also been integrated into search engines such as Bing and Microsoft Edge. Both Microsoft and Google have introduced AI into their applications. For example, Copilot will provide enhanced abilities to summarize documents, add animations to slides in PowerPoint, and create stories in Word. It is being marketed as a tool to improve employee efficiency by automating time-consuming tasks such as composing or reading long emails or summarizing datasets in Excel.

In a comparable way, Google is testing incorporating AI into its Workspace applications. Google’s tool, called Gemini, is designed to do many of the same things as Microsoft’s Copilot. The program uses machine learning to generate emails, draft documents, and summarize data trends, among many other functions. Keep in mind that as AI continues to evolve, governments will seek to regulate its use and application.

For both Microsoft and Google, the goal is to use AI technology to enhance worker productivity. Adding automation to routine, time-consuming tasks can free up employees to focus on more pressing and impactful tasks that will contribute to their companies in a greater way. We can expect to see more AI in these programs and others with time. There will certainly be advances in deep learning models, leading to better image recognition and analysis; more realistic content generation, including art, music, speech, etc.; and more sophisticated interaction with users via natural language models. There are already news channels that are completely AI-generated!

Moreover, NLP applications will have a huge impact on education. Colleges and universities have already been considering what to do about tools such as ChatGPT in the classrooms. Much the same way as pocket calculators revolutionized the mathematics curriculum, pushing educators to steer away from teaching tedious calculations by hand in favor of more conceptual knowledge and problem-solving skills, large language models like ChatGPT will force educators to rethink the purpose of writing papers. Will students in a literature class use an NLP model to summarize the plot, theme, and tropes of Shakespeare’s Hamlet without the need to even read the play? If so, then educators must find alternative ways to engage their students. Perhaps the assignment will morph into using an NLP to create a contemporary music video based on the mysterious death of Ophelia (a character in Hamlet); a thorough understanding of the play would be essential for the student to be able to evaluate whether AI did a good job or not.

AI could also be used to train student teachers. Imagine interacting with an NLP model that simulates a sixth-grader with attention-deficit/hyperactivity disorder (ADHD). The student teacher would gain valuable experience working with different personality types even before entering a real classroom.

We are already seeing AI tools being used as advanced tutors in science, mathematics, computer science, and engineering. For example, Socratic (by Google) is a mobile app that provides step-by-step explanations and solutions to homework problems in various subjects, leveraging powerful AI models. The use of AI in both helping students solve problems and generating educational content (including training future data scientists, who will use AI themselves!) can offer several advantages. Natural language processing models can efficiently generate high-quality content, covering a wide range of topics and providing explanations that are clear and accessible to learners. Think of AI as a force multiplier, freeing the human authors and educators to focus on the big picture. However, it's essential to recognize that while NLPs can generate content, they are merely tools created by humans. Therefore, the responsibility lies with educators and content creators to ensure that the generated material is accurate, interesting, up-to-date, and aligned with the learning objectives of the course or text.

ChatGPT and Ethical Considerations of Using NLP

In the preceding sections, we introduced the powerful natural language processing model ChatGPT. While ethical issues must be considered throughout the entire data science process (and are discussed in depth in Ethics Throughout the Data Science Cycle), NLPs and large language models themselves present some unique ethical concerns.

As noted earlier, the current version of ChatGPT is primarily based on the transformer model, a deep learning architecture that incorporates a self-attention mechanism. Whereas simpler models like recurrent neural networks and long short-term memory networks have rudimentary memory structures, the transformer model utilizes a very sophisticated internal memory system that can capture long-range and large-scale dependencies in the input data. These models are trained on vast amounts of text data using unsupervised learning techniques, allowing them to generate humanlike text and perform various natural language processing tasks.

The main components of NLP models are as follows:

  • Tokenization: The process of breaking down a piece of text into smaller units, such as words or subwords, so that the structure and meaning of the input text can be analyzed.
  • Word embeddings: Words are converted into numerical vectors called word embeddings. These vectors represent the meaning of words in a high-dimensional space, allowing a deep understanding of semantic relationships between words.
  • Contextual understanding: the meaning of individual words and phrases are analyzed within the larger context of a whole conversation rather than relying on only the immediate surrounding words.
  • Feedback loop: As users interact with an LLM and give feedback on its responses, the model can continually learn and improve over time.

Protections for Artists and Their Intellectual Property

Advanced NLP models such as ChatGPT, DALL-E, Lyria, and the many others currently available present both remarkable opportunities and serious ethical issues. The use of these models raises concerns about appropriate protections for human artists and their creations. One major concern is how the models are trained. Essentially, anything that is available on the internet could be used as input to train an NLP model. This includes original text, art, and music that humans have created. Whether the original content is copyrighted or not, an NLP model may have access to it and be trained to mimic the real thing.

With the ability to produce vast amounts of content quickly, there is a huge risk of devaluing the work of human creators and perpetuating issues related to copyright infringement and intellectual property rights. As such, it’s imperative to establish clear guidelines and regulations to safeguard the rights of human artists and ensure fair compensation for their work. Additionally, the deployment of ChatGPT and similar models in content creation should be accompanied by transparent disclosure and attribution practices to distinguish between AI-generated content and human-authored works.

The film industry especially has been both transformed and disrupted by generative AI, offering new opportunities for innovative filmmaking and major concerns about how AI usage might lead to infringement of copyright, job displacement, and breaches of data privacy. In the summer of 2023, both the Screen Actors Guild-American Federation of Television and Radio Artists (SAG-AFTRA) and the Writers Guild went on strike against the Alliance of Motion Pictures and Television Producers (AMPTP). The AMPTP includes the five big U.S. movie studios plus Amazon, Netflix, and Apple. The strikes, lasting four and five months, respectively, highlighted concerns about safeguarding union members against the potential threats to writers’ livelihoods if studios were to replace writing roles with generative AI (GAI) and protection of actors’ and extras’ identities and “likenesses” from being digitally recreated without their consent. The landmark agreements negotiated between unions and studios established what one observer termed “the most progressive AI protections in industry history” (Luna & Draper, 2023).

Disclosure and Attribution

The deployment of ChatGPT and similar AI models in content creation necessitates transparent disclosure and attribution practices to ensure clarity and accountability regarding the origin of generated content. Transparent disclosure involves clearly indicating when content has been generated by an AI model rather than authored by a human. This disclosure helps users understand the nature of the content they are interacting with and manage their expectations accordingly.

Attribution practices involve giving credit to the appropriate sources or creators of content, whether human or AI-generated. In the context of ChatGPT, attribution practices may involve indicating the use of AI assistance in content creation and acknowledging the contributions of the underlying model. This helps maintain transparency and integrity in content creation processes as well as respect for the efforts of human creators.

Currently, disclosure and attribution practices in ChatGPT and similar models vary depending on the specific use case and platform where the model is deployed. Some platforms may include disclaimers or labels indicating that content has been generated by an AI model, while others may not provide explicit disclosure. Additionally, attributing AI-generated content to specific models or algorithms may be challenging, as the content generation process often involves complex interactions between multiple components and datasets.

Moving forward, there is a need for standardized guidelines and best practices regarding disclosure and attribution in AI-generated content. This includes establishing clear standards for labeling AI-generated content, defining appropriate attribution mechanisms, and promoting transparency and accountability in AI-based content creation. More about this topic will be discussed in Ethics Throughout the Data Science Cyle.

Limitations of NLPs

Take a look at the “photo” of a musician playing an upright bass in Figure 7.15. Do you notice anything strange? First of all, the bass seems to be floating in front of the woman (or is she balancing it by some sort of handle sticking out of the neck of the instrument?). There are too few strings (there should be four). Even the machine heads (tuners, on the headstock, or very top, of the instrument) are a bit off, with apparently three of them showing on one side and with missing or malformed keys. Another physiological detail that AI-generated images often don’t depict accurately are hands, with missing or misshapen fingers (such as in the musician’s left hand in Figure 7.15).

Image of a woman playing a bass that was created by artificial intelligence. There are numerous details that do not make sense such as a misshapen hand, the number of strings, and the lack of a bow to play the instrument.
Figure 7.15 Image of a Musician Created by OpenArt. While the picture resembles a real photo at first, there are numerous details that do not make sense. (Image generated using OpenArt [link to: https://openart.ai/create] from the prompt “A woman playing upright bass.” Note: Each use of this prompt will produce varying output.)

These kinds of problems with detail are very common in AI-generated material. There are various reasons for this, including lack of adequate training data, inherent bias, lack of diversity in training, complexity of the objects in the image, and prioritization of general style and composition over attention to detail by the AI model. This is sometimes called the “fingers problem,” as AI has trouble in getting the right number of fingers on a human hand (though, in Figure 7.15, there seems to be the correct number of fingers visible).

Another issue that plagues LLMs is that of hallucinations, which are generated responses that have no basis. AI hallucinations are essentially made-up responses that seem to answer the prompt but are incorrect, misleading, or lacking proper context. Since everything that an LLM produces is, in a sense, “made up,” the LLM cannot easily detect when it is giving a hallucination. Imagine asking a five-year-old how a car runs. She might tell you that there’s a loud troll that lives under the hood turning the wheels by hand, drinking the gasoline that her parents continually feed it. In the absence of a true definition, it sounds plausible to the child.

Finally, it should be mentioned that any AI model is only as good as the training data that is fed into it. Despite remarkable advancements in NLP models and LLMs, a major concern is the potential for bias and unfairness due to unforeseen problems in its training data. AI systems, including ChatGPT, can inadvertently perpetuate stereotypes or amplify existing societal biases present in the training data. Moreover, NLP models like ChatGPT may struggle with understanding and generating contextually appropriate responses, leading to instances of misinformation or inappropriate content generation.

There are also concerns about overreliance on LLMs in content generation, potentially leading to a lack of diversity in perspectives or overlooking nuances in the subject matter. In some sense, a fully trained LLM produces an average, taking in the wonderful, rare, creative, surprising, and eccentric creations of humans and mixing all that in with the vast amount of the routine, mundane input data to come up with a plain vanilla response.

Exploring Further

Bias and Ethics in AI

The effects of bias in deep learning and AI models are critical to understand, especially in important areas like hiring, finance, and law enforcement. Bias in AI can lead to unfair outcomes, perpetuating existing social inequalities. The article “Using Python to Mitigate Bias and Discrimination in Machine Learning Models” provides a practical guide for addressing these issues. It demonstrates how Python libraries such as HolisticAI, Scikit-Learn, and Matplotlib can be used to identify and mitigate bias in machine learning models. This process involves assessing model performance across different demographic groups and applying techniques to ensure fair treatment and equitable outcomes.

The Ethical AI Libraries That Are Critical for Every Data Scientist to Know discusses various other libraries designed to enhance ethical practices in AI development, such as IBM's AI Fairness 360, Google's What-If Tool, and Microsoft's Fairlearn. Each tool offers unique features to identify, assess, and mitigate biases in datasets and models.

Malicious Uses of AI

User privacy, data security, potential copyright infringement, and devaluation of human-created content are just the tip of the iceberg; hallucinations, bias, and lack of diversity in training data often lead to inaccuracies. Much more concerning than these unintentional inaccuracies are the malicious uses of AI to deceive people.

Deepfakes are products of AI systems that seem realistic and are intended to mislead people. For example, in January 2024, a number of residents of New Hampshire received a robocall (automatic recorded message by phone) that sounded like President Biden discouraging voters from voting in the Democratic presidential primary election. It was later discovered that the recording was created by AI by a person associated with one of Biden’s rivals in the primary race.

Deepfake videos have been used to create fake celebrity endorsements, political propaganda, and illicit pictures that look realistic, highlighting the need for legal and regulatory frameworks to address the misuse of AI-generated content.

In order to address concerns about the malicious use of AI, tech companies and organizations now promote the ideal of responsible AI, which refers to the ethical and socially conscious development and deployment of artificial intelligence systems. It encompasses a set of principles, practices, and guidelines aimed at ensuring that AI technologies are developed and used in ways that are fair, transparent, accountable, and beneficial to society.

Key aspects of responsible AI include ethical considerations, transparency, accountability, fairness and bias mitigation, data security, societal and environmental impacts, and design principles that are human-centric, topics that we will explore further in Ethics Throughout the Data Science Cyle. The main goal of responsible AI is to foster the development and deployment of AI technologies that align with human values, respect human rights, and contribute to a more equitable and sustainable future.

Datasets

Note: The primary datasets referenced in the chapter code may also be downloaded here.

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.