7. Grouping and Replacing Fields

 
Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial

Overview

Occasionally particular data points are not always uniformly described the same way. In this lesson, we use the group and replace feature to apply uniformity to these data points.

Lesson Notes

Lesson Goal

This goal of this lesson is to apply uniformity to values in a field when there are small discrepancies.

Why group and replace fields

Sometimes text data can seem a lot more variant than it actually is. This is especially true for fields which come from free text entry. Different spellings, typos, abbreviations etc. These can make one value like multiple different values.

For example, a company called TruColor Limited may have many varieties. TruColour Limited, TruColor Ltd, TruColor, etc.

As you can see, all these variants refer to the same company. But a program like Tableau Desktop will read each entry as separate businesses.

We can use Group and Replace function to apply a uniform name to all similar values.

Manual selection

The first method we can use to group and replace fields is Manual Selection. Here the user first selects a value to group by. Then the user selects any fields from the dataset which they’d like to replace with the grouping value. This is the most accurate group and replacement method. However, it is not efficient for large datasets

Pronunciation

This is an automatic method which means it can comb through large datasets. It groups and replaced values based on how close their pronunciation is. This method would be quite adept at pairing “Smith” with “Smyth”. However, there will likely be matching errors. It’s important to check the validity of these matches.

Common characters

This is another automatic method. It groups values on characters they have in common, regardless of order. This can be useful for matching pairs like “John Smith” and “Smith John”. Just like Pronunciation, this can create matching errors. Be sure to check these matches too.

Transcript

In the previous lesson, we used the separate values function to split our addresses into separate fields of like information.

In this lesson, we'll learn how to use the group and replace feature, and use it to standardize our data.

Occasionally our data relies on human input, whenever this occurs it's safe to assume that there will be human error. This is especially true when data is gathered in the form of free text. Free text data will inevitably content spelling mistakes, inconsistent spelling across entries, abbreviations and so on.

For example, there are a wide variety of possible spellings for a company called TruColor. This is problematic for Tableau Desktop because it will treat each of these variants as a separate company.

In our current data set there is one obviously inconsistency in the County field. We have two counties that look very similar, Surrey with an e, and Surrey without an e.

There's no county in the UK spelled S-U-R-R-Y so this is almost certainly a typo. We'll fix this with a group and replace function.

We'll click the arrow on the County card and hover over Group and Replace.

There are four types of group and replace actions.

In this lesson, we'll focus on the first three types. The fourth type, spelling, is a new feature and is not yet fully deployed.

Let's start with Manual Selection.

To the left we must choose the value we're going to group by. This is the value we wish to keep.

Surrey with an e is the correct spelling so we'll select it.

Now on the right we have to choose the values we want to replace.

We'll click on the incorrect spelling for Surrey and click Done.

Now every cell that used to read the incorrect spelling of Surrey will contain the correct spelling.

If we take a quick look at the Town field we can see that there are a lot of values.

There may or may not be inconsistencies here but combing through each value with a manual selection would quickly become a tedious task.

Instead, we'll try the Common Characters option in the group and replace function.

This is a more automated form of group and replace which identifies mismatches for us. It will attempt to guess which values to group and replace based on common characters between those values. For example, Smith John and John Smith would be interpreted as the same value because they share the exact same characters even though the order is different. If we scroll down we can see that values are now tagged with a paperclip, this tag indicates that Tableau Prep has made a grouping based on this value. Let's look at St Austell and what values it replaced.

We can see that the original entries were very similar. Extra spacing, and a period after S-T made up the only differences. In this case, automatic grouping worked quite well.

Like Town, street name also contains lots of values so we'll need to use an automated approach here too. We'll use Common Characters again.

This time, only one grouping was made.

High Port View replaced Higher Port View.

It's unlikely that these are the same value so we'll delete this change in the Changes pane.

Let's try a different automatic method called Pronunciation.

This works by grouping values which have a similar pronunciation.

Let's take a look at the grouping based on 149 Malden Road.

Clearly there's a problem here, Tableau Prep assumed that 134 Malden Road and 149 Malden Road were the same address, this is because Pronunciation ignores numeric characters. We'll quickly undo this by unchecking 134 Malden Road on the right hand side to remove it from the grouping. Let's look for a value with no numbers.

We'll scroll down and select Clifton Road where Road is abbreviated.

Inside this grouping, we can see it has correctly recognized Clifton Road, where Road is spelled out, as the same value.

This highlights an important limitation of the automated group and replace methods, the results must always be checked manually.

To complete this task, we must check each grouping and undo any bad matches.

I'll do this off camera.

Our customer addresses data set is know properly cleaned and well formatted. Let's quickly recap this lesson. First, we addressed the unusual spellings of Surrey in the County field by using the Manual Selection group and replace option. Then when faced with the many values in the Town field, we used Common Characters to automatically detect similar values.

Finally, we used the same approach for the street name field, this failed so we used Pronunciation instead and manually fixed any inaccuracies.

In the next lesson we'll look at how to save our work and output our data.