Sign in or start a free trial to avail of this feature.
4. Understanding Joins in Tableau
In this lesson, we examine the performance improvements that joins can provide. We will also investigate some problems you might encounter and how these can be detected and fixed.
Performance improvements from joins
- The performance recorder shows that a joined dataset creates less tasks for Tableau
- As a consequence, performance is often improved by joining datasets
- However, joining two larger datasets prior to aggregation may result in slow loading times
Errors that can happen in joins
- The biggest source of errors in joining datasets is duplication
- Duplication occurs if the join is on a column where one data source does not contain unique values
- As a sense check, use the row counter in Tableau when performing joins
- This will tell you if duplication is happening
In the previous lesson, we learned how to join different data sets together in Tableau. Let's now check out, what performance improvement, if any, this join creates. I'm going to go to my sheet, Start The Performance Recording, and then Refresh Extracts.
I'll enter my password and this process begins.
Once it's finished, I'll Stop Performance Recording.
And my performance sheet appears in Tableau, after a little bit of loading.
In this particular example, you can see that we have three events generating the extract on the refresh, connecting to the data source, and then executing a query from my simple chart.
When we have joins, generating the extract can take quite long and in this case over five seconds. But then executing the query and connecting to the data source are both pretty fast actions. Off-camera, I'll show you my results for the query of both datasets when not joined.
As you can see from the results, we need to do over double the number of events to get the same answer, because we need to connect to two different data sources, execute two queries and blend the data. Typically, generating the extract won't see a big difference between the two options, but connecting to the data source and executing the query show a marked difference between the two alternatives. And as a result, I'd often use joining data as a way of improving query performance. I'll just switch between the two again.
Returning to our dataset, I'm going to look at joins in more detail in this lesson. And in particular, focus on some problems you may encounter with duplicate data.
In the current example, I have 11,982 rows in my joined table.
If I look at my transaction data table, I'll notice that if I increase the size to say 20,000, that it tells me I have the exact same number in the transactions table.
And when I look at the "Fee earner" by division table, I have 21 rows. So this join makes sense. There's 11,982 transactions, therefore there should be 11,982 rows in my joined table.
The reason that this works very well is because the "Fee earner" by division table, contains a unique column called "Fee earner".
Now, if we open up this data set, we can see that the "Fee earner" is unique. And as a result, it's very easy to join across this column. However, if I try to join across a different column, that's not unique, for example, "Budget rate per hour", then I may run into some problems.
Let's see what happens by going to my join and switching "Fee earner" to "Budget rate per hour".
And when Tableau updates, you can now see that I have much more than my 11,982 transactions in my table. In actual fact, I've 20,023. And this is because the "Budget rate per hour" is not a unique identifier. And my "Fee earner" by division data table, and it's creating problems in the join.
So whenever you perform a join on a dataset, always be sure to check the number of rows in the combined table, so that a tallies with your expectation. Sometimes, you may need to join across two columns. For example, if "Fee earner" was split into two columns, first name and surname, then I would need to join on both columns to make sure that I had the correct "Fee earner". To join a multiple columns, you simply add a new join clause, and select the column that you wish to include. In this case, I'll include "Fee earner" just to fix my current join.
And when I do this, I return to the original number of rows.
These issues regarding duplication often arise from differences in granularity between data sets. For example, combining daily transaction data with monthly quotas, can create havoc during the joining process. If you are having difficulty implementing a join with different levels of data granularity, it may be worth exploring how to blend data using relationships, which I'll show you in the next lesson.