Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

2.3 Web Scraping and Social Media Data Collection

Principles of Data Science2.3 Web Scraping and Social Media Data Collection

Learning Outcomes:

By the end of this section, you should be able to:

  • 2.3.1 Discuss the uses of web scraping for collecting and preparing data for analysis.
  • 2.3.2 Apply regular expressions for data manipulation and pattern matching.
  • 2.3.3 Write Python code to scrape data from the web.
  • 2.3.4 Apply various methods for parsing, extracting, processing, and storing data.

Web scraping and social media data collection are two approaches used to gather data from the internet. Web scraping involves pulling information and data from websites using a web data extraction tool, often known as a web scraper. One example would be a travel company looking to gather information about hotel prices and availability from different booking websites. Web scraping can be used to automatically gather this data from the various websites and create a comprehensive list for the company to use in its business strategy without the need for manual work.

Social media data collection involves gathering information from various platforms like Twitter and Instagram using application programming interface or monitoring tools. An application programming interface (API) is a set of protocols, tools, and definitions for building software applications allowing different software systems to communicate and interact with each other and enabling developers to access data and services from other applications, operating systems, or platforms. Both web scraping and social media data collection require determining the data to be collected and analyzing it for accuracy and relevance.

Web Scraping

There are several techniques and approaches for scraping data from websites. See Table 2.2 for some of the common techniques used. (Note: The techniques used for web scraping will vary depending on the website and the type of data being collected. It may require a combination of different techniques to effectively scrape data from a website.)

Web Scraping Technique Details
Web Crawling
  • Follows links on a web page to navigate to other pages and collect data from them
  • Useful for scraping data from multiple pages of a website
XPath
  • Powerful query language
  • Navigates through the elements in an HTML document
  • Often used in combination with HTML parsing to select specific elements to scrape
Regular Expressions
  • Search for and extract specific patterns of text from a web page
  • Useful for scraping data that follows a particular format, such as dates, phone numbers, or email addresses
HTML Parsing
  • Analyzes the HTML (HyperText Markup Language) structure of a web page and identifies the specific tags and elements that contain the desired data
  • Often used for simple scraping tasks
Application Programming Interfaces (APIs)
  • Authorize developers to access and retrieve data instantly without the need for web scraping
  • Often a more efficient and reliable method for data collection
(XML) API Subset
  • XML (Extensible Markup Language) is another markup language used exchanging data
  • This method works similarly to using the HTML API subset by making HTTP requests to the website's API endpoints and then parsing the data received in XML format
(JSON) API Subset
  • JSON (JavaScript Object Notation) is a lightweight data interchange format that is commonly used for sending and receiving data between servers and web applications
  • Many websites provide APIs in the form of JSON, making it another efficient method for scraping data
Table 2.2 Techniques and Approaches for Scraping Data from Websites

Social Media Data Collection

Social media data collection can be carried out through various methods such as API integration, social listening, social media surveys, network analysis, and image and video analysis. APIs provided by social media platforms allow data scientists to collect structured data on user interactions and content. Social listening involves monitoring online conversations for insights on customer behavior and trends. Surveys conducted on social media can provide information on customer preferences and opinions. Network analysis, or the examination of relationships and connections between users, data, or entities within a network, can reveal influential users and communities. It involves identifying and analyzing influential individuals or groups as well as understanding patterns and trends within the network. Image and video analysis can provide insights into visual trends and user behavior.

An example of social media data collection is conducting a Twitter survey on customer satisfaction for a food delivery company. Data scientists can use Twitter's API to collect tweets containing specific hashtags related to the company and analyze them to understand customers' opinions and preferences. They can also use social listening to monitor conversations and identify trends in customer behavior. Additionally, creating a social media survey on Twitter can provide more targeted insights into customer satisfaction and preferences. This data can then be analyzed using data science techniques to identify key areas for improvement and drive informed business decisions.

Using Python to Scrape Data from the Web

As noted previously, web scraping is a strategy of gathering data from the internet using automated mechanisms or programs. Python is one of the popular programming languages used for web scraping due to its various libraries and frameworks that make it easy to pull and process data from websites.

To scrape data such as a table from a website using Python, we follow these steps:

  1. Import the pandas library. The first step is to import the pandas library, which is a popular Python library for data analysis and manipulation.
import pandas as pd
  1. Use the read_html() function. This function is used to read HTML tables from a web page and convert them into a list of DataFrame objects. Recall from What Are Data and Data Science? that a DataFrame is a data type that pandas uses to store multi-column tabular data.
df = pd.read_html("https://......
  1. Access the desired data. If the data on the web page is divided into different tables, we need to specify which table we want to extract. We have used indexing to access the desired table (for example: index 4) from the list of DataFrame objects returned by the read_html() function. The index here represents the table order in the web page.
  2. Store the data in a DataFrame. The result of the read_html() function is a list of DataFrame objects, and each DataFrame represents a table from the web page. We can store the desired data in a DataFrame variable for further analysis and manipulation.
  3. Display the DataFrame. By accessing the DataFrame variable, we can see the extracted data in a tabular format.
  4. Convert strings to numbers. As noted in Chapter 1, a string is a data type used to represent a sequence of characters, such as letters, numbers, and symbols that are enclosed by matching single (') or double (") quotes. If the data in the table is in string format and we want to perform any numerical operations on it, we need to convert the data to numerical format. We can use the to_numeric() function from pandas to convert strings to numbers and then store the result in a new column in the DataFrame.
df['column_name'] = pd.to_numeric(df['column_name'])

This will create a new column in the DataFrame with the converted numerical values, which can then be used for further analysis or visualization.

In computer programming, indexing usually starts from 0. This is because most programming languages use 0 as the initial index for arrays, matrices, or other data structures. This convention has been adopted to simplify the implementation of some algorithms and to make it easier for programmers to access and manipulate data. Additionally, it aligns with the way computers store and access data in memory. In the context of parsing tables from HTML pages, using 0 as the initial index allows programmers to easily access and manipulate data from different tables on the same web page. This enables efficient data processing and analysis, making the task more manageable and streamlined.

Example 2.4

Problem

Extract data table "Current Population Survey: Household Data: (Table A-13). Employed and unemployed persons by occupation, Not seasonally adjusted" from the FRED (Federal Reserve Economic Data) website in the link (https://fred.stlouisfed.org/release/tables?rid=50&eid=3149#snid=4498) using Python code. The data in this table provides a representation of the overall employment and unemployment situation in the United States. The table is organized into two main sections: employed persons and unemployed persons.

In Python, there are several libraries and methods that can be used for parsing and extracting data from text. These include the following:

  1. Regular expressions (regex or RE). This is a built-in library in Python that allows for pattern matching and extraction of data from strings. It uses a specific syntax to define patterns and rules for data extraction.
  2. Beautiful Soup. This is an external library that is mostly used for scraping and parsing HTML and XML code. It can be utilized to extract specific data from web pages or documents.
  3. Natural Language Toolkit (NLTK). This is a powerful library for natural language processing in Python. It provides various tools for tokenizing, parsing, and extracting data from text data. (Tokenizing is the process of breaking down a piece of text or string of characters into smaller units called tokens, which can be words, phrases, symbols, or individual characters.)
  4. TextBlob. This library provides a simple interface for most natural language processing assignments, such as argument and part-of-speech tagging. It can also be utilized for parsing and extracting data from text.
  5. SpaCy. This is a popular open-source library for natural language processing. It provides efficient methods for tokenizing, parsing, and extracting data from text data.

Overall, the library or method used for parsing and extracting data will depend on the specific task and type of data being analyzed. It is important to research and determine the best approach for a given project.

Regular Expressions in Python

Regular expressions, also known as regex, are a set of symbols used to define a search pattern in text data. In Python, these expressions are supported by the re module (function), and their syntax is similar to other programming languages. The use of regular expressions offers researchers a robust method for identifying and manipulating patterns in text. With this powerful tool, specific words, characters, or patterns of characters can be searched and matched. Typical applications include data parsing, input validation, and extracting targeted information from larger text sources. Common use cases in Python involve recognizing various types of data, such as dates, email addresses, phone numbers, and URLs, within extensive text files. Moreover, regular expressions are valuable for tasks like data cleaning and text processing. Despite their versatility, regular expressions can be elaborate, allowing for advanced search patterns utilizing meta-characters like *, ?, and +. However, working with these expressions can present challenges, as they require a thorough understanding and careful debugging to ensure successful implementation.

Using Meta-Characters in Regular Expressions

  • The * character is known as the “star” or “asterisk” and is used to match zero or more occurrences of the preceding character or group in a regular expression. For example, the regular expression "a*" would match an "a" followed by any number (including zero) of additional "a"s, such as "a", "aa", "aaa", etc.
  • The ? character is known as the "question mark" and is used to indicate that the preceding character or group is optional. It matches either zero or one occurrences of the preceding character or group. For example, the regular expression "a?b" would match either "ab" or "b".
  • The + character is known as the "plus sign" and is used to match one or more occurrences of the preceding character or group. For example, the regular expression "a+b" would match one or more "a"s followed by a "b", such as "ab", "aab", "aaab", etc. If there are no "a"s, the match will fail. This is different from the * character, which would match zero or more "a"s followed by a "b", allowing for a possible match without any "a"s.

Example 2.5

Problem

Write Python code using regular expressions to search for a selected word “Python” in a given string and print the number of times it appears.

Parsing and Extracting Data

Splitting and slicing are two methods used to manipulate text strings in programming. Splitting a string means dividing a text string into smaller parts or substrings based on a specified separator. The separator can be a character, string, or regular expression. This can be useful for separating words, phrases, or data values within a larger string. For example, the string "Data Science" can be split into two substrings "Data" and "Science" by using a space as the separator.

Slicing a string refers to extracting a portion or section of a string based on a specified range of indices. An index refers to the position of a character in a string, starting from 0 for the first character. The range specifies the start and end indices for the slice, and the resulting substring includes all characters within that range. For example, the string "Data Science" can be sliced to extract "Data" by specifying the range from index 0 to 4, which includes the first four characters. Slicing can also be used to manipulate strings by replacing, deleting, or inserting new content into specific positions within the string.

Parsing and extracting data involves the analysis of a given dataset or string to extract specific pieces of information. This is accomplished using various techniques and functions, such as splitting and slicing strings, which allow for the structured retrieval of data. This process is particularly valuable when working with large and complex datasets, as it provides a more efficient means of locating desired data compared to traditional search methods. Note that parsing and extracting data differs from the use of regular expressions, as regular expressions serve as a specialized tool for pattern matching and text manipulation. In contrast, parsing and data extraction offers a comprehensive approach to identifying and extracting specific data within a dataset.

Parsing and extracting data using Python involves using the programming language to locate and extract specific information from a given text. This is achieved by utilizing the re library, which enables the use of regular expressions to identify and retrieve data based on defined patterns. This process can be demonstrated through an example of extracting data related to a person purchasing an iPhone at an Apple store.

The code in the following Python feature box uses regular expressions (regex) to match and extract specific data from a string. The string is a paragraph containing information about a person purchasing a new phone from the Apple store. The objective is to extract the product name, model, and price of the phone. First, the code starts by importing the necessary library for using regular expressions. Then, the string data is defined as a variable. Next, regex is used to search for specific patterns in the string. The first pattern searches for the words "product: " and captures anything that comes after it until it reaches a comma. The result is then stored in a variable named "product". Similarly, the second pattern looks for the words "model: " and captures anything that comes after it until it reaches a comma. The result is saved in a variable named "model". Finally, the third pattern searches for the words "price: " and captures any sequence of numbers or symbols that follows it until the end of the string. The result is saved in a variable named "price". After all the data is extracted, it is printed out to the screen, using concatenation to add appropriate labels before each variable.

This application of Python code demonstrates the effective use of regex and the re library to parse and extract specific data from a given data. By using this method, the desired information can be easily located and retrieved for further analysis or use.

Python Code

    ## import necessary library
    import re
    # Define data to be parsed
    data = "Samantha went to the Apple store to purchase a new phone. She was specifically looking for the latest and most expensive model available. As she looked at the different options, she came across the product: iPhone 12, the product name caught her attention, as it was the newest version on the market. She then noticed the model: A2172, which confirmed that this was indeed the latest and most expensive model she was looking for. The price made her hesitate for a moment, but she decided that it was worth it price: $799. She purchased the iPhone 12 and was excited to show off her new phone to her friends."
    
    # Use regex to match and extract data based on specific pattern
    product = re.search(r"product: (.+?),", data).group(1)
    model = re.search(r"model: (.+?),", data).group(1)
    price = re.search(r"price: (.+?.+?.+?.+?)", data).group(1)
    
    # Print the extracted data
    print("product: " + product)
    print("model: " + model)
    print("price: " + price)
  

The resulting output will look like this:

product: iPhone 12 model: A2172 price: $799

Processing and Storing Data

Once the data is collected, it should be processed and stored in a suitable format for further analysis. This is where data processing and storage come into play. Data processing manages the raw data first by cleaning it through the removal of irrelevant information and then by transforming it into a structured format. The cleaning process includes identifying and correcting any errors, inconsistencies, and missing values in a dataset and is essential for ensuring that the data is accurate, reliable, and usable for analysis or other purposes. Python is often utilized for data processing due to its flexibility and ease of use, and it offers a wide range of tools and libraries specifically designed for data processing. Once the data is processed, it needs to be stored for future use. (We will cover data storage in Data Cleaning and Preprocessing.) Python has several libraries that allow for efficient storage and manipulation of data in the form of DataFrames.

One method of storing data using Python is using the pandas library to create a DataFrame and then using the to_csv() function to save the DataFrame as a CSV (comma-separated values) file. This file can then be easily opened and accessed for future analysis or visualization. For example, the code in the following Python sidebar is a Python script that creates a dictionary with data about the presidents of the United States, including their ordered number and state of birth. A data dictionary is a data structure that stores data in key-value pairs, allowing for efficient retrieval of data using its key. It then uses the built-in CSV library to create a CSV file and write the data to it. This code is used to store the US presidents' data in a structured format for future use, analysis, or display.

Python Code

    import csv
    
    # Create a dictionary to store the data
    presidents = {
        "1": ["George Washington", "Virginia"],
        "2": ["John Adams", "Massachusetts"],
        "3": ["Thomas Jefferson", "Virginia"],
        "4": ["James Madison", "Virginia"],
        "5": ["James Monroe", "Virginia"],
        "6": ["John Quincy Adams", "Massachusetts"],
        "7": ["Andrew Jackson", "South Carolina"],
        "8": ["Martin Van Buren", "New York"],
        "9": ["William Henry Harrison", "Virginia"],
        "10": ["John Tyler", "Virginia"],
        "11": ["James K. Polk", "North Carolina"],
        "12": ["Zachary Taylor", "Virginia"],
        "13": ["Millard Fillmore", "New York"],
        "14": ["Franklin Pierce", "New Hampshire"],
        "15": ["James Buchanan", "Pennsylvania"],
        "16": ["Abraham Lincoln", "Kentucky"],
        "17": ["Andrew Johnson", "North Carolina"],
        "18": ["Ulysses S. Grant", "Ohio"],
        "19": ["Rutherford B. Hayes", "Ohio"],
        "20": ["James A. Garfield", "Ohio"],
        "21": ["Chester A. Arthur", "Vermont"],
        "22": ["Grover Cleveland", "New Jersey"],
        "23": ["Benjamin Harrison", "Ohio"],
        "24": ["Grover Cleveland", "New Jersey"],
        "25": ["William McKinley", "Ohio"],
        "26": ["Theodore Roosevelt", "New York"],
        "27": ["William Howard Taft", "Ohio"],
        "28": ["Woodrow Wilson", "Virginia"],
        "29": ["Warren G. Harding", "Ohio"],
        "30": ["Calvin Coolidge", "Vermont"],
        "31": ["Herbert Hoover", "Iowa"],
        "32": ["Franklin D. Roosevelt", "New York"],
        "33": ["Harry S. Truman", "Missouri"],
        "34": ["Dwight D. Eisenhower", "Texas"],
        "35": ["John F. Kennedy", "Massachusetts"],
        "36": ["Lyndon B. Johnson", "Texas"],
        "37": ["Richard Nixon", "California"],
        "38": ["Gerald Ford", "Nebraska"],
        "39": ["Jimmy Carter", "Georgia"],
        "40": ["Ronald Reagan", "Illinois"],
        "41": ["George H. W. Bush", "Massachusetts"],
        "42": ["Bill Clinton", "Arkansas"],
        "43": ["George W. Bush", "Connecticut"],
        "44": ["Barack Obama", "Hawaii"],
        "45": ["Donald Trump", "New York"],
        "46": ["Joe Biden", "Pennsylvania"]
    }
    
    # Open a new CSV file in write mode
    with open("presidents.csv", "w", newline='') as csv_file:
        # Specify the fieldnames for the columns
        fieldnames = ["Number", "Name", "State of Birth"]
        # Create a writer object
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        # Write the header row
        writer.writeheader()
        # Loop through the presidents dictionary
        for key, value in presidents.items():
            # Write the data for each president to the CSV file
            writer.writerow({
                "Number": key,
                "Name": value[0],
                "State of Birth": value[1]
                })
    # Print a success message
    print("Data successfully stored in CSV file.")
  

The resulting output will look like this:

Data successfully stored in CSV file.
Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.