Dr. Mahesh S. Raisinghani

Learning Objectives

By the end of this section, you will be able to:

Define the terms associated with data analytics
Identify the importance and challenges of collecting and using big data
Describe the process of data acquisition
Explain the business analytics process

The process of data analytics involves examining datasets to draw conclusions and insights, typically using statistical and computational methods to inform decision-making or solve problems. It involves techniques and processes for exploring data, often with the aid of technology, to drive actionable intelligence. Analytics is a tool that enables organizations to derive competitive advantage by analyzing historical data, forecasting trends, and optimizing business processes.

The evolution of analytics is described as having three distinct eras:¹

Analytics 1.0: focused on data warehouses and traditional business intelligence (historical reporting and descriptive analytics)
Analytics 2.0: the rise of big data with unstructured and high-velocity data, driven by new technologies like Apache Hadoop
Analytics 3.0: a modern era where businesses blend big data with traditional analytics to create data products that deliver real-time value

Big data allows organizations to gain a comprehensive understanding of their target market and customer base. For example, have you had the experience of searching for a particular item online, such as a new pair of shoes, and then noticed that your social media feed is inundated with ads for shoes and related items? This is a result of the kind of automated market research resulting from data analytics. Organizations gather information about features such as customer demographics, preferences, purchase history, and online behavior. Using this information, analysts can identify patterns and trends. Then, leaders on the marketing team can tailor the organization’s products, services, and marketing campaigns to meet the specific demands of their customers, enhancing customer satisfaction and loyalty.

Importance of and Challenges with Big Data

Every generation presents a new disruptive technology that changes the face of business for those who recognize the potential. Innovations such as the cotton gin, textile mills, the steam engine, and the telegraph all revolutionized some aspects of the world and pushed technology along. In the future, historians will add to this list the processing of big data, which is an extremely large set of structured and unstructured data that cannot be processed or analyzed using traditional database and software techniques. British mathematician Clive Humby is credited with stating, “Big Data is the new oil.”² If he is right, then analysts and companies that recognize the potential for insight will be those who “strike oil” in the business world. So in a sense, like oil extraction, the modern breakthrough is data mining, or analyzing large datasets to discover patterns, trends, and insights using statistical and computational techniques. Companies like Google have led the way with marketing tools that capitalize on big data, helping organizations better understand consumer behavior.³

Challenge: Volume

The collection and use of big data have become increasingly important in today’s business landscape, yet harnessing the very real potential of big data comes with significant challenges. The sheer volume, velocity of production, and variety of data can overwhelm those who cling to traditional data management and analysis methods. Analysts report that by 2025 the global volume of digital information is expected to reach 200 zettabytes.⁴ Organizations must grapple with storing and processing enormous amounts of data. Designers and analysts need to work together to create and maintain scalable infrastructure capable of hosting advanced analytics tools.

Challenge: Quality

In addition to volume, the quality of big data poses challenges, as unstructured and noisy data can hinder accurate analysis and interpretation. This has prompted concern in situations where data analytics is key to success. Reliability issues stem from multiple causes. They can include inaccurate data, redundant entries, and simple human error in data entry.

Duplicated, or redundant, entries are multiple entries placed in the same dataset by mistake. There are various methods to respond to redundant entries. The first and most obvious is to simply remove them. Data engineers may use tools such as basic Python code and spreadsheet functions to filter out corrupt data at prescribed levels to produce a more accurate dataset. Input tools such as QR code scanners can help by automating the process. Another technique to address the issue of redundancy is to assign another value to an outlier (an observation that deviates significantly from the rest of the dataset), potentially indicating anomalies, errors, or unique patterns that require special attention during analysis. In other words, you would choose a value with significantly lower impact on the dataset to replace the outliers, such as an average value.

Challenge: Governance

Have you ever had your identity stolen? If not, you may know someone who has. These concerns relate to privacy and data governance, which is the overall management of the availability, usability, integrity, and security of data used in an enterprise. At the business level, companies do their best to comply with regulations and protect sensitive information. However, enforcement of strict digital privacy laws can vary from state to state, or nation to nation. Companies that do business in Europe must also abide by Europe’s General Data Protection Regulation (GDPR), which as you may recall from 6.1 Key Concepts in Data Privacy and Data Security, is a leading global regulation in terms of enforcing transparency in how data are handled and strictly forbids the purchase and sale of personally identifiable data while allowing individuals the right to be forgotten. The GDPR is built upon several fundamental principles aimed at protecting the personal data of individuals within the European Union (EU). Refer to 6.4 Managing Enterprise Risk and Compliance to review these fundamental principles.

Challenge: Actionable Insights

The process of systematically using statistical and logical techniques to review, sort, and condense data for the purpose of gaining insight on areas of interest is called data analysis. Effective data analysis is critical for ensuring that the information is accurate so that an organization can then extract actionable insights from it. Extracting these insights is, however, a significant challenge. First, it requires special training to increase skills and expertise in this area. Data scientists and analysts must possess a combination of statistical knowledge, programming skills, and domain expertise to navigate the complexities of big data analysis. Additionally, it is important to be able to comprehend the results and to communicate them effectively to a broad audience. Incorrectly linking correlation to causation during data analysis can be an issue with both experts and software, and false positives and false negatives can lead conclusions astray. Additional challenges can arise from the regulations in some regions, such as the EU, that prohibit collecting and storing meaningful data to ensure privacy or decrypting encrypted data.

Link to Learning

Explore the transformative power of big data in the article “Big Data: 6 Real-Life Business Cases,” which delves into six compelling real-world examples where big data analytics have revolutionized business operations across diverse industries.

Data Acquisition

With modern web analytics tools, companies analyze market trends and competitor activities in real time. By collecting and analyzing data from various sources—including social media, industry reports, customer reviews, and online forums—organizations can stay well-informed about market dynamics, emerging trends, and competitor strategies. Interested key decision-makers can then use this information to identify opportunities, anticipate market shifts, and proactively adapt their business strategies to maintain a competitive edge.

Analysts employ several methods to identify and acquire data from various sources, such as web scraping, sensor data collection, social media monitoring, data marketplaces and application programming interfaces (APIs), and internal data collection. Moreover, social media monitoring offers a window into public sentiment and trends, while internal data sources provide valuable organizational insights. These methodologies form the cornerstone of modern data analysis practices.

Automated extraction of data from online sources, typically using software to simulate human browsing behavior and retrieve information from web pages, is called web scraping. These techniques involve employing automated tools or scripts that can gather relevant information from multiple web pages, including but not limited to customer reviews, social media data, news articles, or publicly available datasets.

With the proliferation of Internet of Things devices, analysts can use sensor data collection, which involves gathering data from sensors designed to detect and respond to physical or environmental conditions, such as temperature, pressure, or motion. These sensors generate real-time data on parameters like temperature, humidity, location, or movement, providing valuable insights for industries such as manufacturing, health care, or logistics.

Social media monitoring involves monitoring and collecting data from social media platforms to gain insight into customer sentiment, behavior, and trends. By analyzing social media conversations, comments, likes, and shares, analysts can identify emerging topics, consumer preferences, or even potential brand issues.

Some organizations provide data marketplaces or application programming interfaces. A data marketplace is an online platform or ecosystem where data providers and consumers can buy, sell, or exchange datasets and related services. These marketplaces facilitate the discovery, transaction, and access to data in various formats, often integrating tools for analytics, visualization, and compliance management. An application programming interface (API) is the means by which software applications communicate and interact with each other for the purpose of exchanging data and functionality. These platforms offer a range of data sources, including financial data, weather data, demographic data, and industry-specific datasets. For example, a search using the Google search engine can also lead to ads on Facebook based on user data. Additionally, when you engage a search for specific items, such as a new smartwatch, your query becomes a data point that may be gathered and shared with companies tagging the term “smartwatch.” This will prompt marketing tools in sites like Facebook and Instagram to promote customized ads with smartwatches.

The final main methodology for data acquisition is collection from internal data sources. Organizations often have extensive internal data sources, including transaction records, customer databases, sales data, or operational logs. Analysts can tap into these sources to gather relevant data for analysis and gain insight into their own business operations. This can represent a challenge gathering accurate data if the source becomes adversely affected, such as when a natural disaster occurs.

When collecting big data, analysts should also adhere to ethical considerations, follow data privacy regulations, and obtain proper permissions or consent when required. The importance of big data collection and use cannot be overstated. Organizations that can harness the power of big data gain a competitive edge by leveraging valuable insights for strategic decision-making. However, the challenges associated with big data, including its volume, quality, and the need for specialized skills, must be addressed effectively to unlock its full potential. By overcoming these challenges, businesses can capitalize on the immense value that big data offers and pave the way for innovation, growth, and success in the data-driven era.

The Business Analytics Process

The business analytics process consists of several stages, each one often influencing and informing the next. The process enables organizations to derive actionable insights from data (Figure 8.2).

Data/Business Analysis Process: 1. Problem definition. 2. Data preparation 3. Statistical analysis 4. Results interpretation. 5. Implementation.

Figure 8.2 The business analytics process begins with problem definition, paving the way for data preparation, analysis, interpretation, and implementation. Note that in some cases, it may be necessary to repeat as new problems may have been identified in the process. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Step 1: Problem Definition

The first step in the process is problem definition. Here, the company sets out to name the problem or challenge that needs a solution that involves using analytics. To illustrate this process, consider an organization studying botany that sets out to classify varieties of iris. By clearly defining the problem, organizational leaders can focus their data analytics efforts and ensure alignment with the organization’s goals. They can then begin to gather information to form the classifications.

Step 2: Data Preparation

Next comes collecting, cleaning, and transforming the data for analysis. This step includes gathering data from various sources, integrating separate datasets, and ensuring data quality. That can be a cumbersome task, since the organization will attempt to address issues such as missing values, outliers, or inconsistencies. Techniques for data cleaning and data transformation, such as data normalization or feature engineering, may be applied to ensure the data are suitable for analysis.

Data normalization involves adjusting data so that they have a standard scale, making it easier to compare different types of values. It ensures that one feature does not dominate others due to its scale. Dividing irises into categories is a relatively simple analysis and does not require data normalization. Other examples that would benefit from data normalization include comparing salary in thousands of dollars to years of experience, or comparing house prices and sizes. In the latter example, normalizing the size (by dividing all sizes by the largest size) can put that variable on a comparable scale to prices.

Feature engineering is transforming raw data into useful inputs for a model by creating, modifying, or selecting features (data points). It helps models understand patterns better by making relevant information more accessible. As an example, for predicting house prices, creating a new feature like “price per square foot” combines raw price and size into something more insightful.

As a simple use case, imagine predicting student test scores using hours studied and study material pages read. These features can be normalized so that the number of pages read does not overpower the number of hours studied, and a feature like efficiency (pages read per hour) can be engineered to capture how productive a student is. Effective data preparation is crucial for accurate and reliable results in subsequent stages of the analytics process.

Data Acquisition

There are typically three methods of data acquisition: using built-in libraries, using external datasets, or manually entering data. Each approach has its own merits. Libraries can save time but may be incomplete if the data focus on some items that evolve over time, such as technology. External data are convenient, but large datasets may be challenging to work with, especially if there are multiple sources of data. Manually entering data could prove cumbersome, especially if time is an important factor.

Built-in Libraries

Many programming languages, like Python, can use built-in libraries for the purpose of testing models. If you use Python for data analytics, you’ll find it equipped with powerful libraries of built-in code segments and datasets tailored for various tasks, such as NumPy and Pandas. NumPy is useful for numerical calculations, while Pandas excels in handling data analytics tasks.

With these available libraries, Python becomes an ideal choice for scientific analysis and data-driven applications. Let’s use the classic public domain dataset for iris classification from R. A. Fisher⁵ for this example. The following snippet shows the code for importing the library and creating a pie chart. The example output is shown in Figure 8.3. Note that the line from sklearn import datasets instructs Python to use the library sklearn, which allows access to data on the iris species. You are also importing the matplotlib to create the pie chart.

# Iris Species Data Study
# Import libraries
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt

# Load the Iris dataset
iris_data = datasets.load_iris()

# Create a DataFrame using the data and feature names
iris = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)

# Add the species column by mapping the target integer values to species names
iris['species'] = iris_data.target_names[iris_data.target]

# Plot a pie chart showing the distribution of species in the dataset
species_count = iris['species'].value_counts() # Count occurrences of each species
plt.figure(figsize=(7, 7))
plt.pie(species_count, labels=species_count.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Iris Species')
plt.show()

Pie chart representing Iris Species Proportions: 33.33% - Virginica; 33.33% - Versicolor; 33.33% - Setosa.

Figure 8.3 Python code was written to load and create a pie chart of various species of iris plants. (data source: R. A. Fisher, “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics 7, no. 2 (September 1936): 179–88. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x; attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

You can observe how the code created a simple pie chart output to show the proportion of species of irises.

External Datasets

Using external datasets is the most common method of data collection. Here, the goal is to specify a path and a file name and then import the dataset from another location, which is often a spreadsheet or other standard data file type. The following Python code snippet accomplishes the same task as the previous example. The only difference is that it pulls the data from an external file instead of calling on Python’s self-contained libraries.

# Pull Data from an Excel Spreadsheet
import pandas as pd
import matplotlib.pyplot as plt

# Load the Excel file
df = pd.read_excel('C:/Users/daxbr/iris.xlsx')

# Count the occurrences of each species
x = df['Species'].value_counts()
# Get the labels (species names) from the value counts theLabels
= x.index.tolist()

# Plot the pie chart of species distribution
plt.pie(x, labels=theLabels, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Iris Species')
plt.show()

In this example, Python is instructed to access an Excel file and run analysis on the information contained in the file.

Manually Entered Data

With a small enough dataset, a third option is to manually enter the information. The drawbacks of manually entering data include the time involved in entering data for a large dataset and the possibility of introducing errors in the form of typos. The following code snippet produces an output similar to the previous two examples:

# Enter Your Own Data
theSlices = [33.3, 33.3, 33.3]
theLabels = ["Virginica", "Versicolor", "Setosa"]
theColors = ['#96f97b','#ff81c0','r']

# Plot the pie chart
import matplotlib.pyplot as plt

plt.pie(theSlices, labels=theLabels, colors=theColors, autopct='%1.1f%%', startangle=90)
plt.title('Iris Species Distribution')
plt.show()

The choice to use internal libraries, external data, or manually entered data is made on a case-by-case basis. In practice, it is important to keep in mind that data acquisition may involve a combination of methods depending on where the source data are for a project. For example, in this process where the hypothetical organization conducts a botany study, it may be most appropriate to use the built-in library, since features of iris plants have not changed recently and are generally agreed on in the scientific community. Manually entering the data would be unnecessary.

Step 3: Statistical Analysis

Once the data are prepared, data analysts apply statistical analysis to uncover patterns, relationships, and insights. This stage involves using statistical methods, machine learning algorithms, or data mining techniques, or a combination of methods, to explore and analyze the data (Figure 8.4). Depending on the nature of the data, analysts may employ descriptive statistics, regression analysis, clustering, classification, or predictive modeling. Analysts use these tools to run simulations. This provides opportunities to observe potential costs, predict return on investment (ROI), and identify metrics. Descriptive statistics help to give a “snapshot” of data and provide a jumping-off point for analysis.

Statistical methods: Linear/Logistic regression, Time series/Cluster/Survival analysis, Decision trees, Analysis of variance, General linear models. Machine learning algorithms: Neural network, Deep learning, AI. Data mining techniques: Association rule/Text mining, Anomaly detection.

Figure 8.4 Some analysis methods can be categorized as statistical methods, machine learning algorithms, or data mining techniques, but many can fit into more than one category. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Causality or Correlation?

Correlation does not imply causation, but does one attribute affect another? Returning to the iris data, the following simple command can explore the correlation of sepal length and petal length (Figure 8.5).:

df.corr()

Table with both columns and rows labeled: sepalLength, sepalWidth, PetalLength, PetalWidth and decimal numbers in cells. sepalLength/PetalLength is highlighted yellow (0.871754).

Figure 8.5 The simple Python command generates a comparison of the lengths and widths of petals and sepals and shows a positive correlation between the sepal length and petal length. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

We can recognize that the petal length and the sepal length are strongly correlated; however, that correlation does not prove causality. It is a common pitfall for an analyst to believe that they have proven causation because of a strong correlation. The challenge in statistics is to remain objective and be cautious about using the word “proof.” For example, there is 100 percent correlation between eating chocolate and being born. Every person who eats chocolate has been born, but being born does not cause one to eat chocolate. In the iris case study, we showed correlation between sepal length and petal length only. There is no evidence from this data that either sepal length causes petal length or petal length causes sepal length.

Step 4: Results Interpretation

The next step is results interpretation, where the insights are translated into actionable information. Analysts evaluate the findings in the context of the problem at hand, interpret the statistical results, and draw conclusions. They often create visualizations, charts, or reports from the data to effectively communicate insights to stakeholders.

Link to Learning

GeeksforGeeks has done their own analysis of the iris dataset where you can notice a number of different visualizations of the data and the Python code that produced the output.

Step 5: Implementation

The best data in the world are functionally useless without action. In the final phase, implementation involves applying the obtained insights and recommendations to practice. This may involve strategic decision-making, process improvements, or operational changes based on the findings. Implementation may also require collaboration across departments or the integration of analytical models into existing systems or workflows.

Link to Learning

Exploratory data analysis (EDA) is the process of examining and understanding data to uncover patterns, trends, and relationships before formal modeling or hypothesis testing. Watch this video demonstration on EDA using Python and the freely available iris dataset.

Footnotes

1Thomas H. Davenport, “Analytics 3.0,” Harvard Business Review 91, no. 12 (December 2013): 64–72, https://hbr.org/2013/12/analytics-30
2Clive Humby, “Data Is the New Oil,” lecture at Association of National Advertisers conference, Orlando, FL, April 30–May 2, 2006.
3Christena Garduno, “How Big Data Is Helping Advertisers Solve Problems,” Forbes, March 15, 2022, https://www.forbes.com/sites/forbesagencycouncil/2022/03/15/how-big-data-is-helping-advertisers-solve-problems/
4Steve Morgan, “The World Will Store 200 Zettabytes of Data by 2025,” Cybersecurity Magazine, February 1, 2024, https://cybersecurityventures.com/the-world-will-store-200-zettabytes-of-data-by-2025/
5R. A. Fisher, “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics 7, no. 2 (September 1936): 179–88. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x

8.1 The Business Analytics Process

Learning Objectives

Importance of and Challenges with Big Data

Challenge: Volume

Challenge: Quality

Challenge: Governance

Challenge: Actionable Insights

Data Acquisition

The Business Analytics Process

Step 1: Problem Definition

Step 2: Data Preparation

Data Acquisition

Built-in Libraries

External Datasets

Manually Entered Data

Step 3: Statistical Analysis

Causality or Correlation?

Step 4: Results Interpretation

Step 5: Implementation

Footnotes