Sign in or start a free trial to avail of this feature.
1. Introduction to Web Scraping
In this lesson, we will introduce the concept of web scraping, and inspect the Chicago Point website to determine which information we would like to incorporate.
To explore more Kubicle data literacy subjects, please refer to our full library.
The process of downloading data from various web sources. Note that this data is often unstructured and may be subject to copyright laws and various use restrictions.
Cross reference Chicago Point tournament information with geodata and create a report that displays each tournament venue on a map.
- Inspect the website
- Connect Alteryx
- Parse data with RegEx
- Convert non-standard characters
- Determine venue geo-coordinates
- Combine website and location data into a final report
The goal of this lesson is to inspect the website to determine the relevant information for the report.
Ctrl + Shift + I
This shortcut allows user to view the code behind different elements of the webpage via the inspection pane.
Ctrl + F
This shortcut allows users to search the inspection pane for specific text. Note that this shortcut will generally bring up the search bar for any program in Microsoft Windows.
Ctrl + U
This shortcut brings up the website source code in a new tab.
HTML “div” tag
The div tag is a generic container for a block of website content such as an image or paragraph of text. In this case, we are looking for the code div id=”calendar”. This highlights the section of the website that contains the calendar of events.
We all know that the internet can provide vast pools of timely data as long as you have a trusted source. Cross-referencing this information with your internal data can facilitate new insights.
In this course, we're going to consider how Alteryx can be used to combine information from different websites to yield a more comprehensive result. We're going to do this through a process called web scraping.
Web scraping is simply the term for downloading data from various web sources.
Note that the data obtained through web scraping is often unstructured, so it requires the use of regular expressions to get it into a usable format. We'll lean pretty heavily on regular expressions in this course, so please review the RegEx lessons in the data manipulation course if you need a refresher. Before we proceed and further a quick note on the matter of copyright.
While much of the information on the web may be freely available, this data often has rights attached. In many cases, you may be entitled to use this information for personal purposes, but not commercially. At least not without permission. Please insure that you have a full understanding of the relevant copyright laws before using any of the tools or techniques that we discuss in this course.
Throughout this course, we're going to reference chicagopoint.com, a website for backgammon enthusiasts. Backgammon is an ancient board game played all over the world and chicagopoint.com maintains a calender of tournaments and events.
Our goal in this course is to produce a report that cross-references this tournament information with geodata, and displays each venue on a map.
Here's how we'll accomplish this goal. First, we'll inspect the website and determine the relevant information for our report. Next, we'll connect Alteryx to the website and download that information. After that, we'll parse the data, making heavy use of the RegEx tool.
We'll then connect to another website to convert and non-standard characters to a format that Alteryx easily understands.
At this point, we'll cross-reference the venue data with the Google Maps API to pull down geocoordinates.
As a final step, we'll combine the website and location data to produce our final report.
Let's address the first step and inspect the Chicagopoint website. We'll start by navigating to www.chicagopoint.com/calendar.html.
We can see if this web page contains various schedule details. The bit that we're interested in is the calendar of events at the bottom.
We want to instruct Alteryx to visit this webpage and return the information contained in the table of dates and events. If we enter the keyboard shortcut control-shift-i, we can view the code behind different elements of the webpage. As we move our mouse over different sections of the text, different parts of the webpage are highlighted.
With longer webpages it can be tricky to find the exact section you're looking for, and so a convenient shortcut is to press control-f and bring up the search bar.
As you become more familiar with website code, you can go directly to the source code by following the shortcut, control-u.
Note that the keyboard shortcuts for other browsers are similar.
Let's go back to the calendar.
We can see that the first event listed is Backgammon by the bay monthly. Note that the event listing may look a bit different if the schedule has been recently updated.
We'll search for this event now.
This search returns various text.
We're interested in the chapter that specifies the calendar. This will be highlighted by the html div tag.
The div tag signifies an element or division.
It's a generic container for a block of website content, such as an image or a paragraph of text.
The line of code angled bracket divid equals calendar angled bracket is the section we're interested in. We'll take note of this code now and stop the lesson here. In the next lesson, we'll use Alteryx to download the relative event data.