11. Dissecting a Regular Expression

 
Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial

Overview

This lesson has a simple walk-through of a typical regular expression in Alteryx that dissects a UK post-code.

Summary

RegEx for UK Postcodes

  • In this lesson we dissected RegEx code for extracting the Area and District segment of a UK postcode
  • The first step in creating the code is understanding the data that you are searching for
  • The code used in this lesson is: ([A-Z]([0-9]{1,2}[A-Z]{0,1}|[A-Z]([0-9]{1,2}[A-Z]{0,1})))
  • You can run this code through www.regexr.com to get a better understanding of exactly how it will work

Transcript

Continuing with the data set we introduced in the previous lesson, let's go through the post code format in a bit of detail.

The postcode area and district component comprises of between 2 and 4 characters.

In its simplest form, these characters come in the following formats: Letter- number such B1, letter-number-letter such as W1A, letter-number-number such as E14, letter-letter-number such as SW1 and letter-letter-number-letter such as EC1A.

To capture the area and district component of the postal code, we've developed the Regex expression currently visible on the screen.

This may look daunting. So let's run through what each section of this code means.

The first section directs the code to look for a letter between A to Z.

This is followed by a bracket. Inside the bracket, we have 2 sub-expressions separated by a bar symbol.

The bar symbol can be read as or.

Like the conditional statements we have looked at previously.

The first sub-expression says look for a number between 0 and 9 appearing once or twice followed by a letter between A to Z appearing 0 times, or one time.

The second sub-expression says look for a letter between A to Z followed by a number between 0 and 9 appear in one time or two times followed by a letter between A to Z appearing no times or one time.

More simply, the first sub-expression will pull any postal codes in the letter-number, letter-number-number and letter-number-letter formats.

The second sub-expression will pull any postal codes in the letter-letter-number and letter-letter-number-letter formats.

This is a confusing topic so there's no harm in taking some time to consider this statement.

For example, you may want to paste it into one of the online regular expression editors such as RegexR.com but just some of the criteria and notice the consequent changes.

In the next lesson, we'll apply this expression to our postal code data set.