13. Other RegEx Outputs

 
Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial

Overview

Learn to use the RegEx tool to scrape web data and create tokens.

Summary

Replace

  • The Replace option simply replaces one expression with another
  • There is an expression in the Formula tool called Regex_Replace that also performs this function

Match

  • The Match function returns a 1 or 0 depending on whether the expression matches the corresponding string
  • There is an expression in the Formula tool called Regex_Match that also performs this function

Tokenize

  • The Tokenize function search for matches on a specific expression, and then parses those matches into separate columns


 

Transcript

In the previous lesson, we used the Regex tool to parse data.

However, other options are available including replace, match and tokenize.

We'll quickly discuss all these options, starting with replace.

While we previously used the Regex tool to separate out certain strings from fields, we can also use it to replace part of the text.

This option is available directly in the Regex tool, but also a function that can be found in the string section of the formula tool under Regex_Replace.

An example of a use case would be to confirm the email addresses are formatted correctly.

Using Regex, we can check that these addresses are formatted as text, followed by an at sign, text again followed by period, and then followed by more text.

The next use option is match.

The match choice turns a Regex query into a true or false statement.

Again, this is also available as a formula Regex_Match.

A use case example here might be to confirm the product codes are in the correct format as a filter before onward analysis.

The final output option is called tokenize.

This is used to extract characters of a specific format.

For example, we could use tokenize to return all hash signs from a body of Twitter text.

Tokenize is particularly useful for online searches.

For example, lets say we wanted to return all the URLs from a particular website.

We can use the Alteryx download tool in conjunction with the Regex tool to do this.

We'll clear our canvas and deploy a text input tool from the in/out tab on the tools palette.

In the configuration window, notice the small tree digraphic on the top right-hand corner.

If we click on this, we can enter the text input tool name.

In this case URL.

In the space below, we'll type in the address of the website we wish to query.

For example: bbc.co.uk We'll now navigate to the connectors tab on the tool palette, and connect a download tool to our workflow.

In the configuration window, we'll ensure that the field dropdown is pointed to URL.

This tool will query the internet page specified previously.

Next, we'll add a Regex tool to the workflow.

In the field to parse dropdown, we'll select download data.

We'll now enter our Regex code.

Next, we'll change the output method to tokenize.

Finally, we'll add a browse window to the workflow and run the tool.

As we scroll through the results, we can see that the Regex code has returned all URLs from the bbc.co.uk website and put them in individual columns.

Clearly understanding regular expressions in more detail opens up a myriad of new possibilities for the data analyst.

Regular expressions are incredibly versatile and can transform unstructured data into rich sources of information.