Sign in or start a free trial to avail of this feature.
1. Adding Trend Lines to Data
For some data sets, we may want to create a model to establish relationships between different variables. In this lesson, we'll go through how to do this with Tableau's trend lines feature.
To explore more Kubicle data literacy subjects, please refer to our full library.
- Trend lines can establish relationships between different variables in your dataset
- Trend lines tend to work best with x-y scatter plots - The strength of the relationship is determined by the r-squared value
- The closer r-squared is to 1, the stronger the relationship
4 trend lines available in Tableau
- Linear - Logarithmic - Exponential - Polynomial
In previous lessons, we've used the analytics tab to generate average lines, constant lines, and even distribution bands. However, we haven't looked at trying to create a model from the data using this part of the tab.
We'll begin our analysis by looking at trend lines. Trend lines are used to see if we can establish a relationship between the variables shown on the chart. In this case, between sales and quantity.
The relationship between the variables can be of four different types.
When I drag the trend line, I can see there may be a linear, logarithmic, exponential, or polynomial relationship.
Let's start off with linear.
When I drag the linear trend line onto the chart, it creates a large gray line that represents the best fit, given all of the data points included.
The quality of the fit is determined by a value called R-Squared, which I can see when I hover over the trend line. The closer R-Squared is to one, the more accurate my trend line is at predicting sales given quantity, using the equation shown on screen. In this example, my R-Squared is 0.896, which is quite high. And so in a linear model, the total quantity sold is a reasonably good estimate for predicting sales.
With this model, I can now take a quantity and then using this equation, calculate sales, with a reasonable level of predictability.
Let's now try another model, say logarithmic. So I'll drag my trend off the line, and then put in logarithmic. When I now hover over logarithmic, I can see that R-Squared is much lower than for linear. And so the linear model is a better option than logarithmic, and from my data, represents a better model for the relationship between quantity and sales.
So let's go back to linear.
Around the trend line, you might've noticed two adjacent lines which represent a 95% confidence band.
The only time you'll need to use the confidence bands is when your data represents a sample, and is not the total data population. Say that this data was the sample of a much larger data set.
The confidence band indicates a 95% level of confidence that the best fit trend line for the total population exists within this interval. Obviously, the wider apart these lines, the less likely your calculation from the sample represents the best trend line for the total population. When your dataset represents the total population, you can simply remove the lines. By going to Edit, and Show Confidence Bands removed, and press Okay.
And this now leaves you with your trend line. Sometimes you'll want to include multiple trend lines on a particular chart. To do this, first add a color code.
And in this example, I'm going to color code by location. And in particular, town.
Unfortunately, there are simply too many towns in my data set for this to make any sense. So let's remove town, and instead, I'll go with county.
And with county, I now have five different trend lines representing my data.
Unfortunately, three of these counties have very few data points. So you can see, that they simply have horizontal lines in the chart. However, if I remove these lines from the chart, by right clicking on the legend and excluding, I can now see I have two different trend lines for the two different counties. And so when I'm estimating sales from quantity, my equation for Surrey is different to my equation for Cornwall. And using this analysis, we can see different relationships emerging for different variables. In the next lesson, we'll go back to the analytics tab and explore clustering and forecasting.