WEB SCRAPING, DATA VISUALIZATION, SELENIUM

Guide to web scraping using python & selenium

Photo by Ilya Pavlov on Unsplash

CONTENT

  • What is Selenium?
  • Why Selenium?
  • Project Pre-requisites
  • Website
  • Code Snippets
  • Data Visualization

What is Selenium? Selenium is an open-source web-based automation tool. Originally developed in ThoughtWorks as an in-house tool in 2004, it was eventually released as an open-source.

While Selenium is understood as a testing tool, some working on web scraping would likely come across selenium in the form of web drivers. So what gives? In the context of testing, Selenium is used in the locating of data on a website, thereby verifying the specific element is present/absent on the page. …


DATA VISUALIZATION

A look at customizing options towards enhancing visualizations, and checking out some lesser-known plot functions along the way

Data visualization provides a visual context through maps or graphs. In doing so, it translates the data to a more natural form for the human mind to comprehend and pick out patterns or points of interest. A good visualization facilitates the conveyance of information or calls to action as part of the data presentation (storytelling).

Image by Author | Good for laughs, use at own risk for professional presentations.

The idea for the underlying project was partially inspired while maintaining records on iaito orders some years back. I chose this data because I felt this could be a way to bridge data science (visualization and analytics) with traditional art (Iaido), and as a visualization…


Sorting out statistical tests & usage one at a time

Navigating statistical tests methods for effective data analysis

Foreword

The main driver for this article stems from an effort to deepen understanding and note down for future reference, some of the common statistical methods and their application context. For each of the methods, an example dataset is instantiated along with code implementation. It is then followed by an interpretation of the results.

Publication to Medium was semi-automated using the “jupyter to medium” library. The notebook will be deployed as a draft, allowing for flexibility in fine-tuning. Depending on the purpose of the jupyter notebooks, the mileage and productivity may vary. For example, the Pandas data frames are converted into…


Everyone’s running sprints, try a marathon

My journey through the data science course

Table of Contents

1. Introduction

2. Course Overview

3. The good

4. Areas of Improvement

5. Closing

Photo by Boitumelo Phetla on Unsplash

1. Introduction

Launched in late 2020, the IBM Data Science course is one of the few data science courses released by IBM to help graduates and working professionals seeking to pivot and break into data science. Other courses from IBM include the IBM Data Analyst Professional course and IBM Data Engineering Professional course.

Each course comprises several modules, with each module linked to a digital badge and certificate upon completion. Interested learners can opt to complete particular modules or all the modules to earn the full course certification.

2. Course Overview


Be deliberate in the problem-solving process

Addressing a business problem using a combination of structured problem solving, data analysis & machine learning

The cumulation of Coursera’s IBM Data Science Professional Course is a capstone project that requires course participants to identify a business problem that requires the use of location data and neighborhood clustering. The ability to analyze business problems, cut through the noise, and identify the actual issue to be addressed is an important skill to have and constantly hone. If the right questions are not identified, the effectiveness of the model results would be greatly diminished or rendered meaningless. …


Web Scraping, Data Visualization

Simple web scraping and visualization, using insights to address dietary concerns

As the covid situation flared up once again in the city-state, we naturally cut down our time spent outdoors and try to minimize time spent in crowded areas as far as possible. Nevertheless, the pantry will require periodic replenishment. Also, the Household Overlord has been remarked that online grocery delivery time slots are few and far in between. So, if we were to make the trip ourselves to the supermarkets, we would want to be informed on the items available and make targetted acquisition trips.

Using skills picked up from UpLevel Web scraping MasterClass, I suggested to the wife that…


Metrics and approaches towards mitigating multi-collinearity for Linear Regression Model

Feature selection is a process where the predictor variables that contribute most significantly towards the prediction/ classification of the target variable are selected. In feature selection for linear regression models, we are concerned with four aspects regarding the variables. Framed as a mnemonic “LINE”, these are:

  1. Linearity. The selected variable possesses a linear relationship with the target variable.
  2. Independence of predictor variables. Selected variables to be independent of each other.
  3. Normality. Residuals generally follow a normal distribution (mean of zero).
  4. Equality of variance. The residual errors are generally consistent across the values of predictor variables (i.e. Homoscedasticity).

In cases where…


Python EDA tools to facilitate EDA

Photo by Markus Spiske on Unsplash

Data cleaning and Exploratory Data Analysis go hand-in-hand — with a better understanding of the data, can one be better positioned to spot errors or outliers for mitigation.

Most of us do EDA through pandas functions, coupled with visualizations using matplotlib to seaborn. Occasionally, we define functions to do 1) automated and 2) customized EDA of datasets (e.g. doing EDA on multiple, large datasets before merging them). While working on correlation matrices for a pet project, I came across Sweetviz & Pandas profiling and incorporated them into my EDA workflow. Here are some of the observations:

Pandas Profiling

Compatibility & Installation: Weary…


An exploration of text classification using text feature and topic modeling, from concept to deployment

Word features from the Motive text. (Image by author)

One and a half years ago, I chanced upon Python and Anaconda as a tool for Data Science (DS) while taking part in a Data Hackathon. The myriad of Python libraries underscored Python’s versatility as a toolkit for data science. Sadly, I didn’t possess the necessary skillset to utilize Python back then. Since then, I took online courses in Python and SQL, and gradually developed an interest in the field of DS. Seeking to expedite the transition into DS, I enrolled in the Data Science Immersive (DSI) program by General Assembly. The DSI is a 3 months intensive boot camp…

ShengJun

Data Science Enthusiast, Analyst. Sharing insights from own learning journey and pet projects in this space. Linkedin Profile: www.linkedin.com/in/ShengJunAng

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store