# 9. Dissecting a Regular Expression

Overview

In this lesson we dissect a regular expression that parses a UK postal code.

To explore more Kubicle data literacy subjects, please refer to our full library.

Summary

1. Lesson Goal (00:10)

In this lesson, we’ll learn how Regular Expressions are constructed and how they operate.

2. Understanding the Format of the Target Field (00:19)

In order to create the correct RegEx code for your specific application, you first need to understand the format of the target field.

In our example, we’ll break down the format of a UK postal code. The area and district portion of the post code comprises of between 2 and 4 characters with the following possible formats:

·        Letter, Number

·        Letter, Number, Letter

·        Letter, Number, Number

·        Letter, Letter, Number

·        Letter, Letter, Number, Letter

3. Developing the Code (01:06)

This code will capture the area and district information:

([A-Z]([0-9]{1,2}[A-Z]{0,1}|[A-Z]([0-9]{1,2}[A-Z]{0,1})))

Let’s break this down into constituent parts:

• [A-Z] will search for a letter between A and Z
• The | acts an “OR” function that separates two subexpressions.
• The first subexpression, [0-9]{1,2}[A-Z]{0,1}, will search for a number between 0 and 9 appearing once or twice followed by a letter between A and Z appearing zero times or one time.
• The second subexpression, [A-Z]([0-9]{1,2}[A-Z]{0,1})  will look for a letter between A and Z, followed by a number between 0 and 9 appear in one time or two times, followed by a letter between A to Z appearing zero times or one time.

Transcript

In the previous lesson, we introduced the concept of regular expressions.

In this lesson, we'll go into a little more detail about how regular expressions are constructed and how they operate.

Let's go through the postcode format in a bit of detail.

The postcode area and district component comprises of between two and four characters.

In its simplest form, these characters come in the following formats: letter, number, such as B1; letter, number, letter, such as W1A; letter, number, number, such as E14; letter, letter, number, such as SW1; and letter, letter, number, letter, such as EC1A.

To capture the area and district component of the postal code, we've developed the RegEx expression currently visible on the screen.

This may look daunting, so let's run through what each section of this code means.

The first section directs the code to look for a letter between A to Z.

This is followed by a parenthesis. Inside the parenthesis, we have two sub-expressions separated by a bar symbol. The bar symbol can be read as or, like in the conditional statements we've looked at previously.

The first sub-expression says to look for a number between zero and nine appearing once or twice, followed by a letter between A to Z appearing zero times or one time.

The second sub-expression says look for a letter between A to Z, followed by a number between zero and nine appearing one time or two times, followed by a letter between A to Z appearing no times or one time.

More simply, the first sub-expression will pull any postal codes in the letter, number; letter, number, number; and letter, number, letter formats.

The second sub-expression will pull any postal codes in the letter, letter, number, and letter, letter, number, letter formats.

This is a confusing topic, so there's no harm in taking some time to consider this statement.

For example, you may want to paste it into one of the online regular expression editors, such as regexr.com, adjust some of the criteria, and notice the consequent changes. We'll stop the lesson here.

In the next lesson, we'll apply this expression to our postal code data set.

Data Manipulation
Finding and Replacing Data

Contents

07:21

05:51

06:47

06:22

06:44

06:26

04:31

02:51

02:58

04:16

#### 11. Other RegEx Outputs

04:38

My Notes

You can take notes as you view lessons.

Sign in or start a free trial to avail of this feature.