Sign in or start a free trial to avail of this feature.
1. Introduction to Web Scraping
In this lesson, we will introduce the concept of web scraping, and inspect the Chicago Point website to determine which information we would like to incorporate.
Overview of Web Scraping (00:04)
Web scraping is simply the term for downloading data from various web sources. Data obtained from web sources can be combined with internal data to generate insights. Data obtained through web scraping is often unstructured, requiring the use of regular expressions to convert the data to a more usable format.
When downloading data from the web, you should always ensure you understand any relevant copyright laws applying to the data, and ensure you have any permissions required to use the data.
Course Structure (01:21)
In this course, we’ll download data from chicagopoint.com, a website for backgammon enthusiasts. The website maintains a calendar of backgammon tournaments and events. We aim to produce a report cross-referencing this data with geodata, to produce a map of backgammon tournament locations. This process involves six steps:
Inspect the chicagopoint.com website and identify relevant information
Connect Alteryx to the website and download relevant information
Parse the data using regular expressions
Use another website to convert non-standard characters to a format Alteryx understands
Determine geo-coordinates for tournament venues using the Google Maps API
Combine website and location data to produce the final report
Inspecting the Website (02:26)
We first want to inspect the website to find the relevant data. This page contains a calendar of upcoming backgammon events. In the Google Chrome browser, we can view the code for the webpage by pressing Ctrl + Shift + I. We can also search this code by pressing Ctrl + F, and searching for the text of interest. This allows us to match elements on the webpage with their corresponding code.
On a webpage, the word div signifies a block of website content, like an image or a paragraph of text. In our case, we find that the code <div id=”calendar”> is used to indicate the calendar of upcoming events. This is the section of the webpage that contains the data of interest to us.