Web Scraping, Data Visualization

Web Scraping Using Python (I)

Simple web scraping and visualization, using insights to address dietary concerns

As the covid situation flared up once again in the city-state, we naturally cut down our time spent outdoors and try to minimize time spent in crowded areas as far as possible. Nevertheless, the pantry will require periodic replenishment. Also, the Household Overlord has been remarked that online grocery delivery time slots are few and far in between. So, if we were to make the trip ourselves to the supermarkets, we would want to be informed on the items available and make targetted acquisition trips.

Using skills picked up from UpLevel Web scraping MasterClass, I suggested to the wife that we could perhaps explore and keep tabs on the items we would want to purchase. Intrigued, she agreed and provided some grocery categories to focus on — rice, noodles, cooking ingredients.

This post covers some of the basic concepts learned from the UpLevel Web Scraping masterclass and applied on a personal project, as well as further research and findings along the way.

Photo by Markus Spiske on Unsplash

HTML is a markup language that employs a specific syntax to establish a hierarchical structure of the webpage’s content. Different elements describe the text as a heading, paragraph, body, list, attribute, span, button, and so on. Opening and closing tags of an element are typically in the form of:

# example
<element_tag>text</element_tag>

Per the hierarchical structure, the child elements are nested under the parent element as such (Main is the parent with Header & Paragraph as child elements):

# example
<main>
<h1>Header</h1>
<p>Paragraph</p>
</main>

Before carrying out web scraping, it is prudent to study the web page to have better sensing of the relevant HTML elements pointing to the data we are looking for. Once we identified those tags, we can proceed.

Process Overview

Web scraping would then be carried out through requests or selenium, depending on the characteristics of the web page. With the scraped data, we can then use the identified HTML tags to extract the data, clean /organize them into the structure and format we want.

Web Pages (Static & Dynamic)

Static web pages as their name imply remain the same until a navigation request is made by the User. So, when a User visits a static web page, the User’s web browser sends a request to the server. The server then sends a response for the web page back to the User’s web browser. For static web pages when a server receives a request for a web page, then the server sends the response to the client without doing any additional process.

In the case of dynamic web pages, the web page interoperates at the server-side. The rendered data would vary (hence, the dynamism) depending on the request call. In this project’s case, the web page of interest is dynamic, as evidenced by the addition of items to the inventory page as one continues to scroll down the web page. The grocery category of interest is ‘Rice, Noddles & Cooking Ingredients’.

Code Implementation at a Glance

For this particular webpage, BeautifulSoup is not necessary. Instead, JSON was used. The URL was derived from inspecting the network requests (right-click on the web page->inspect->network). Delving into the JSON keys, we begin to find information that could help cue us in on the data we seek.

# Import Libraries
import pandas as pd
import requests
# make request
test_res = requests.get(test_url)
# get the JSON from response
test_json = test_res.json()
# review json keys
test_json.keys()
#returns dict_keys(['code', 'status', 'data'])
# review
test_json['data'].keys()
#returns dict_keys(['count', 'filters', 'pagination', 'product'])

While ‘product’ holds the relevant data we seek, ‘pagination’ also provides some useful information like so:

test_json['data']['pagination']
#returns dictionary with key: value pair of
{'page': 2, 'page_size': 20, 'total_pages': 149}

In particular, the page size indicates the expected number of items per page is twenty, while the total number of pages should be a hundred and forty-nine. These helped provide (1) a reference for sanity-check to ensure data corresponding to the correct number of items are returned per page of data, and (2) a dynamic way to scrape data should the number of pages vary. The JSON dictionary is then converted into a data frame using pandas.

#use pd.json_normalize to turn the dict into a DataFrame
test_df = pd.json_normalize(test_product)

From here on, the data is then further reviewed and more processing carried out. For this instance of the data, a particular column ‘storeSpecificData’. However, the data is structured within another dictionary. So a similar approach is again used to extract the relevant data and organize them into another data frame.

test_df['storeSpecificData']
#returns
0 [{'currency': {'id': 106, 'name': 'SGD', 'symb...
1 [{'currency': {'id': 106, 'name': 'SGD', 'symb...
2 [{'currency': {'id': 106, 'name': 'SGD', 'symb...

Putting Everything Together

Once we have good sensing of how items from one page can be extracted and organized, we scale the process. It is also here where we used ‘test_json[‘data’][‘pagination’][‘total_pages’]’ to dynamically invoke the total number of webpages.

# declare an empty list to contain all pages' worth of DataFrames
df_list = []
# loop 1 to last page
print('Processing commences...')
for page_num in range(1,test_json['data']['pagination']['total_pages']+1):

# construct the temp URL
temp_url = base_url + str(page_num) +end_url

# make a GET request
temp_res = requests.get(temp_url)

# get json from response object
temp_json = temp_res.json()

# get the dataframe
temp_df = pd.json_normalize(temp_json['data']['product'])

#append df into list of dfs
df_list.append(temp_df)
# combine the df in the list of dfs, into a huge df
final_df = pd.concat(df_list)
# reset index
final_df.reset_index(drop=True,inplace=True)
# get the entire DataFrame's "storeSpecificData" column
store_data = final_df['storeSpecificData']
# create an empty list to store the rows of data
store_list =[]
# use a for loop to loop through each row to unpack the single list
for store in store_data:
# normalize the first item in the list
store_temp = pd.json_normalize(store[0])

# append the DataFrame into the list
store_list.append(store_temp)
# concat everything in the list into a single DataFrame
final_store_df = pd.concat(store_list)
# reset the DataFrame's index
final_store_df.reset_index(drop=True,inplace=True)

Both data frames are then combined.

# combine the DataFrame with the unpacked "storeSpecificData"
final_combined_df = pd.concat([final_df, final_store_df], axis = 1)

More Data Processing, Analysis & Visualization

Data is further processed to extract the dietary attribute labels. ‘Healthier Choices’ (i.e. reduced sugar content), ‘Trans Fat-Free’, and ‘Lactose-Free’ were selected as the primary labels to focus on. After that, it's data exploration. For example, the countries of origin for Rice, Noodles & Cooking Ingredients…

Image by Author

Item Price by Country of Origin:

Image by Author | red-zone arbitrarily added to highlight higher-cost items

Higher-cost items:

Image by Author

Percentile of Items versus Retail Price:

Image by Author | 90% of items cost below 10sgd (dated 27 May data)

The next pertinent topics closer to home were:

  • help Home Overlord identify the five pasta source that is lactose-free (within a budget of 50sgd)
  • advise Dad on the available brand of sugar-free sweeteners
df_lwrprice = df_viz.loc[df_viz['retailPrice']<100].copy()df_sauces = df_lwrprice.loc[(df_lwrprice['LactoseFree']=='Yes') & 
(df_lwrprice['Cat'] == 'Cooking Paste & Sauces')].sort_values('retailPrice').reset_index(drop=True)
df_sauces.iloc[-5:]
Image by Author | the last pasta was not lactose-free; a dietary disaster
df_hc = df_lwrprice.loc[(df_lwrprice['LowerSugar']=='Yes') & 
(df_lwrprice['Cat'] == 'Sugar & Sweeteners')].sort_values('retailPrice').reset_index(drop=True)
df_hc.iloc[-5:]
Image by Author | On hind-sight, cutting out sweeteners totally might be a wiser choice

Conclusion

So that’s it! Using the data, we also checked out other food items (e.g. noodles, pasta) availability, and price before we made the essential trips to the supermarket.

The code and data can be found here. The data was scraped on 27 May 2021. The page should have been refreshed since then, so it should not be a surprise if the total number of pages returned differs the next time the code is run. There might be subsequent changes to the web page, so the code will need to be modified if that happens. Stay safe, and have fun scraping and sifting data!

Note: The starter code for web scraping was based on the code-along session by UpLevel. Interested readers may check out their masterclasses and project catalogs for avenues to further improve their coding skills.

Data Science Enthusiast, Analyst. Sharing insights from own learning journey and pet projects in this space. Linkedin Profile: www.linkedin.com/in/ShengJunAng

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store