Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial

Overview

We’ll continue where we left off in the previous lesson and run through how to use the RegEx tool to clean the backgammon tournament data.

Summary

Lesson Goal

The goal of this lesson is to go through the remaining steps required to parse and clean the data.

Key Steps

  1. Separate out pieces of tournament data
  2. Organize date information

Step 1: Separate out pieces of tournament data

Several actions are required to accomplish this step:

  • Use a RegEx tool to parse the Output field for the date information using the code ([ADFJMNOS][a-z]{2}\s?\d\d?([\-]\d\d?)?)(?=(\.)+)
  • Use another RegEx tool to parse the Output field for the date information using the code ((https?:\/\/((w{3})\.)?))([A-Za-z+-_.-\/]+)
  • Use a Formula tool to consolidate the website URL into a single column using a simple addition
  • Use a Select tool to remove unnecessary fields and rename RegExOut1 to Date
  • Use a third RegEx tool to parse the Output field for the tournament name information using the code ">([A-Za-z0-9\s\/\(-\üö#&;:<>')]{4,})<\/a
  • Use a fourth RegEx tool to parse the Output field for the venue information using the code (>,\s?)(.+(<a|\.\.|\.\d))
  • Use a fifth RegEx tool to parse the RegExOut2 field for the venue information using the code (.+\b)\.
  • Use a Select tool to remove unnecessary fields, rename RegExOut1 to Tournament, and RegExOut1_3 to Venue.

 

Step 2: Organize date information

Several actions are required to accomplish this step:

  • Use a Text to Column tool to split out the end date of each tournament (if applicable)
  • Use another Text to Column tool to split out the month and start date
  • Use a Select tool to remove unnecessary fields and rename Date2 as Day (end), Date11 as Month, and Date12 as Day (start)
  • Use a Formula tool to convert the date fields into a format Alteryx can read and clean up the venue information. Specifically:
    • Use a Switch expression to return the number associated with each month
    • Use both a DateTimeParse and ToString expression to combine the start day, month, and year information, and convert the value to a string
    • Use an If statement, a DateTimeParse expression, and a ToString expression to combine the end day, month, and year information, and convert the value to a string
    • Use a REGEX_Replace expression to replace ampersands with the word “and” in the Venue field
    • Use another REGEX_Replace expression to replace “SWIZ” with the word “Switzerland” in the Venue field
    • Use a third REGEX_Replace expression to replace “AUSL” with the word “Australia” in the Venue field
    • Copy the information in the Venue field into a new field called Address
  • Use a Data Cleansing tool to remove leading or trailing white space
  • Use a Select tool to bring forward only necessary fields

About Regular Expressions

You can learn more about how to build Regular Expressions from this site.

Transcript

In the previous lesson, we started the process of using the RegEx tool to tidy up our data for further analysis.

Our goal in this lesson is to go through the remaining steps required to parse and clean the data.

These steps are quite repetitive, so instead of adding the tools in real time, we'll just run through the completed workstream.

In order to finish off the parsing and cleaning process, I went through two key steps.

First, I separated out the various pieces of tournament data.

Next, I organized the date information. If we look at where we stopped in the last lesson, we can see that we have two columns, Output and Year.

If we click on the Output field for row 1 and then open the Cell Viewer, we can see that this cell contains four key pieces of information, that is, date which consists of month and day, website address, tournament name, and location.

We'll move on to the first step and see how to separate the tournament data.

I started with the date details.

As we can see, I used a RegEx tool that targeted the Output field and entered the rather complex expression you see here.

This expression looks for a three-letter word beginning with a capital letter from this group, followed by two small letters.

These capital letters are the possible first letters for every month.

This expression then looks for a space, followed by one or two numbers after the month.

There's then an additional choice of one or two more numbers at the end.

As we can see, this tool resulted in three columns where RegExOut1 contained the relevant date information.

I then connected another RegEx tool to parse out the website information.

Again, I targeted the Output field, but this time I entered a different expression. This looks for text, like https followed by www dot.

However, notice the question marks.

These mean that those expressions may or may not be present.

The code then looks for any text from the character set A to Z, both upper and lower case, as well as underscores, periods, dashes and forward slashes.

This tool resulted in five outputs, however, the entire website address is presented in the combination of columns RegExOut1_2 and RegExOut5.

We want the complete website, so I connected a Formula tool, created a new formula called Website, and simply combined the two fields.

I then cleaned up the workstream by connecting a Select tool, renaming RegExOut1 to Date, and bringing forward only the Output, Year, Website and new Date fields. Moving on, we needed to get the tournament name information. If we click on Output, row 1 and then Cell Viewer, we can see that the tournament name usually begins with an angled bracket and closes with a slash a, surrounded by angled brackets.

This is helpful information for the next RegEx tool.

The code I entered here looks for a wide character set appearing between the two bookends, but there must be at least four characters.

After this, I connected two RegEx tools to separate out venue information.

Before moving on to those, let's look at the Cell Viewer for the Output field again. Here, we can see that the venue information comes after an angled bracket and a comma, and before a period and a number.

In some cases, the characters at the end of our venue string are an angled bracket, followed by the letter a.

I've entered a Regular Expression that captures all data between the possible bookends.

You often come across a few different string formats when pulling data from the web, so you'll probably need to go through a few different iterations of irregular expression before you get it right. Notice that some of the entries in the output from this tool still have extraneous characters.

This is where the second RegEx tool comes in.

Note that this tool is pointed at RegExOut2, the field with our venue information.

The code we use here will remove the excess characters.

If we look at the output, we can see that our tournament name is listed in the column RegExOut1, and the venue name is listed in the column RegExOut1_3.

At this point, I cleaned up the data set with a Select tool, renaming the two new fields Tournament and Venue. As a final step, I organized the date information.

At this point, we have a column for Year and a column for Date, which consists of both month and day. Note that some tournaments last longer than one day. I isolated each part of this date information with two Text to Column tools. The first one extracts the end date, if applicable, while the second separates out the month and start date.

I then used the Select tool to rename these fields as Month, Day start and Day end, again, bringing forward only the relevant fields.

This date information currently isn't in a format Alteryx can read, so I connected a Formula tool to address this.

I first used a Switch function to return the associated number for each month entry.

I then used the DateTime parse formula to create the tournament start day by combining the start day, Month and Year information.

Next, I created a formula that returns the tournament end day. If that Day end field is empty, this formula uses the same DateTime parsed expression as before to return the start day. If the Day end field has a value, then a new DateTime parsed expression combines the Day end field, the Month and the Year.

At this point, I also added a few formulas to clean up the Venue field. First, I entered a formula that replaces ampersands with the word and.

Next, I created formulas that replace the abbreviation S-W-I-Z with Switzerland, and the abbreviation A-U-S-L with Australia.

I also created a formula that copies the Venue field into a new field called Address.

These last four formulas will be helpful going forward as we convert characters to a standard format and combine the Status set with information from the Google Maps API.

As a final step, I used a Data Cleansing tool to remove any leading and trailing white space, and a Select tool to bring forward only the Tournament, Website, Address, Day to start, Day to finish, and Venue fields. This has been a somewhat involved couple of lessons where we've iterated and parsed the main body of text using RegEx expressions.

As you can see, RegEx is extremely useful for parsing third party website data.

However, as mentioned in a previous course, RegEx can be complicated and can require a bit of trial and error. To learn more, be sure to check out the links in the lesson notes below. As we've looked through the data set in this course, you may have noticed that it includes some non-standard characters, such as vowels with accents and umlauts.

In the next lesson, we'll look at how to tidy up these characters so we can more easily manipulate the information downstream.

Contents

My Notes

You can take notes as you view lessons.

Sign in or start a free trial to avail of this feature.

Free Trial

Download our training resources while you learn.

Sign in or start a free trial to avail of this feature.

Free Trial