Sign in or start a free trial to avail of this feature.
2. Inner Joins
In this lesson, we learn how to use Alteryx's join tool to bring together records horizontally using an INNER JOIN.
To explore more Kubicle data literacy subjects, please refer to our full library.
- The Join tool can only accept two input
- Data is merged based on matches in one or fields as specified by the user
- The Join tool will produce three output datasets
- The L and R outputs consist of entries that are unique to the correlating input datasets
- The J output consists of entries that in both the L and R datasets. This is known as the Inner Join
- Connection Progress contains information on the number of rows moving forward through a workflow as well as the file size
- Connection Progress boxes only appear for momentarily by default
In the previous lesson, we introduced the union join which helped us to combine data sets top to bottom by lining up identical fields.
In this lesson, we'll consider how to combine two data sets horizontally using a shared or common field.
We'll continue using the corporate events data set introduced in the last lesson.
By looking at our merge cells data, we can see that customers are linked to a specific account number.
We would like to know further details about these customers, but this information is in a separate file-the customer data set.
We'll bring a new input icon onto the canvas and connect this to the customer data set.
Using the preview section of the configuration window, we can see that the field customer number is also contained in the customer data set.
Given that there is a common field between the sales data set and the customer data set, we can combine these two files using the join icon.
We'll go to the join tab on a tools palette and bring the join icon onto the canvas.
We'll now connect the complete sales data set to the input note labeled L on the top left.
We'll then bring in the customer data set by connecting it to the bottom left node labeled R.
Lets now navigate to the configuration window, so we can specify fields used for merging data.
In this case, we are using customer number.
Let's take a minute to look at the three Venn diagrams presented in the configuration window.
These diagrams graphically explain the output from the join data.
Data that's common to both sets is known as an inner join and is signified by the J icon.
Fields that are specific to one set or the other are signified by the L and R diagrams.
We'll run the tool to see what the resulting data sets look like in practice.
We'll now click on the middle output note labeled J or the inner join.
This data set consists of customer numbers common to both inputs.
Lets now click on the bottom output note labeled R.
This data set contains customer numbers from the customer data set but not from the sales data set.
In other words, these are customers of the business who did not make a purchase in the 3 months under review.
Finally, we'll click on the top output node labeled L.
The output window should show us information on all transactions in the first three months made by customers who aren't listed in the customer data set.
As expected, this data set only contains null values as all customers have been correctly accounted for in the master customer data set.
It should be noted that this result will be somewhat dependent on the data sets being accessed.
For example, if you are connecting to a database in the live environment, it is possible that new customers could have recorded a transaction before their information is reconciled with a customer database.
It's just something to bear in mind.
The join tool is very useful in filtering out and merging data.
Using it can be a bit complicated so let's run through it again.
The join tool requires two input data sets.
Users must then select at least one field from each input to compare.
Any record that has an identical match in the fields being compared will be displayed in the J output node.
Any record that is unique to the L input node will be displayed in the L output node.
And any record that is unique to the R input node will be displayed in the R output node.
In the next lesson, we'll continue with this topic by looking at what we can do with the data that is separated and combined using this tool.