Sign in or start a free trial to avail of this feature.
1. Introduction to Web Scraping
In this lesson, we will introduce the concept of web scraping, and inspect the Chicago Point website to determine which information we would like to incorporate.
To explore more Kubicle data literacy subjects, please refer to our full library.
Overview of Web Scraping (00:04)
Web scraping is simply the term for downloading data from various web sources. Data obtained from web sources can be combined with internal data to generate insights. Data obtained through web scraping is often unstructured, requiring the use of regular expressions to convert the data to a more usable format.
When downloading data from the web, you should always ensure you understand any relevant copyright laws applying to the data, and ensure you have any permissions required to use the data.
Course Structure (01:21)
In this course, we’ll download data from chicagopoint.com, a website for backgammon enthusiasts. The website maintains a calendar of backgammon tournaments and events. We aim to produce a report cross-referencing this data with geodata, to produce a map of backgammon tournament locations. This process involves six steps:
Inspect the chicagopoint.com website and identify relevant information
Connect Alteryx to the website and download relevant information
Parse the data using regular expressions
Use another website to convert non-standard characters to a format Alteryx understands
Determine geo-coordinates for tournament venues using the Google Maps API
Combine website and location data to produce the final report
Inspecting the Website (02:26)
We first want to inspect the website to find the relevant data. This page contains a calendar of upcoming backgammon events. In the Google Chrome browser, we can view the code for the webpage by pressing Ctrl + Shift + I. We can also search this code by pressing Ctrl + F, and searching for the text of interest. This allows us to match elements on the webpage with their corresponding code.
On a webpage, the word div signifies a block of website content, like an image or a paragraph of text. In our case, we find that the code <div id=”calendar”> is used to indicate the calendar of upcoming events. This is the section of the webpage that contains the data of interest to us.
We all know that the internet can provide vast pools of timely data as long as you have a trusted source. Cross-referencing this information with your internal data can facilitate new insights.
In this course, we're going to consider how Alteryx can be used to combine information from different websites to yield a more comprehensive result.
We're going to do this through a process called web scraping.
Web scraping is simply the term for downloading data from various web sources.
Note that the data obtained through web scraping is often unstructured, so it requires the use of regular expressions to get it into a usable format. We'll lean pretty heavily on regular expressions in this course, so please review the RegEx lessons in the data manipulation course if you need a refresher.
Before we proceed any further, a quick note on the matter of copyright.
While much of the information on the web may be freely available, this data often has rights attached.
In many cases, you may be entitled to use this information for personal purposes, but not commercially, at least not without permission Please ensure that you have a full understanding of the relevant copyright laws before using any of the tools or techniques that we discuss in this course.
Throughout this course, we're going to reference chicagopoint.com, a website for backgammon enthusiasts. Backgammon is an ancient board game played all over the world and chicagopoint.com maintains a calendar of tournaments and events.
Our goal in this course is to produce a report that cross-references this tournament information with geodata and displays each venue on a map.
Here's how we'll accomplish this goal.
First, we'll inspect the website and determine the relevant information for our report. Next, we'll connect Alteryx to the website and download that information.
After that, we'll parse the data, making heavy use of the RegEx tool.
We'll then connect to another website to convert any non-standard characters to a format that Alteryx easily understands.
At this point, we'll cross-reference the venue data with the Google Maps API to pull down geo-coordinates.
As a final step, we'll combine the website and location data to produce our final report.
Let's address the first step and inspect the Chicago Point website.
We'll start by navigating to www.chicagopoint.com/calendar.html.
We can see that this webpage contains various schedule details. The bit that we're interested in is the calendar of events at the bottom.
We want to instruct Alteryx to visit this webpage and return the information contained in the table of dates and events. If we enter the keyboard shortcut control-shift-I, we can view the code behind different elements of the webpage. As we move our mouse over different sections of the text, different parts of the webpage are highlighted.
With longer webpages, it can be tricky to find the exact section you're looking for, and so a convenient shortcut is to press control-F and bring up the search bar.
As you become more familiar with website code, you can go directly to the source code by following the shortcut control-U.
Note that the keyboard shortcuts for other browsers are similar.
Let's go back to the calendar.
We can see that the first event listed is Prime Time Chicago.
Note that the event listing may look a bit different if the schedule has been recently updated.
We'll search for this event now.
This search returns various text.
We're interested in the chapter that specifies the calendar. This will be highlighted by the HTML div tag.
The div tag signifies an element or division.
It's a generic container for a block of website content such as an image or paragraph of text.
The line of code is the section we're interested in. We'll take note of this code now and stop the lesson here.
In the next lesson, we'll use Alteryx to download the relevant event data.