1. Introduction to Web Scraping

 
Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial

Web Scraping

9 lessons , 0 exercises

Preview Course

Overview

In this lesson, we will introduce the concept of web scraping, and inspect the Chicago Point website to determine which information we would like to incorporate.

Summary

  1. Overview of Web Scraping (00:04)

    Web scraping is simply the term for downloading data from various web sources. Data obtained from web sources can be combined with internal data to generate insights. Data obtained through web scraping is often unstructured, requiring the use of regular expressions to convert the data to a more usable format.

    When downloading data from the web, you should always ensure you understand any relevant copyright laws applying to the data, and ensure you have any permissions required to use the data.

  2. Course Structure (01:21)

    In this course, we’ll download data from chicagopoint.com, a website for backgammon enthusiasts. The website maintains a calendar of backgammon tournaments and events. We aim to produce a report cross-referencing this data with geodata, to produce a map of backgammon tournament locations. This process involves six steps:

    1. Inspect the chicagopoint.com website and identify relevant information

    2. Connect Alteryx to the website and download relevant information

    3. Parse the data using regular expressions

    4. Use another website to convert non-standard characters to a format Alteryx understands

    5. Determine geo-coordinates for tournament venues using the Google Maps API

    6. Combine website and location data to produce the final report

  3. Inspecting the Website (02:26)

    We first want to inspect the website to find the relevant data. This page contains a calendar of upcoming backgammon events. In the Google Chrome browser, we can view the code for the webpage by pressing Ctrl + Shift + I. We can also search this code by pressing Ctrl + F, and searching for the text of interest. This allows us to match elements on the webpage with their corresponding code.

    On a webpage, the word div signifies a block of website content, like an image or a paragraph of text. In our case, we find that the code <div id=”calendar”> is used to indicate the calendar of upcoming events. This is the section of the webpage that contains the data of interest to us.

Transcript