Sign in or start a free trial to avail of this feature.
3. Box Plots
Box plots are a great alternative to bar charts because they show both the average and each individual data point. As a result, you can see how data is distributed within a column or row.
To explore more Kubicle data literacy subjects, please refer to our full library.
Why use box plots
- Offer much more granularity than aggregated calculations, such as: average
- Help you to understand the distribution of data and impact of outliers on aggregates
- Provide you with additional metrics such as whiskers and hinges
Box plot components
- Median: The middle value in a data set
- Hinges: Distance between the hinges represent the middle 50% of data points
- Whiskers: Distance between the whiskers represent the middle 75% of data points
- Hinges and whiskers can be adjusted to your own measurements if you wish
Box plots are one of the most underrated visualizations in the business and should be used much more often by analysts. I feel the reason they are not more popular, is because box plots require a small amount of statistical knowledge to interpret the charts correctly.
The problem that box plots solve, relate to the aggregation of data.
Let's begin by asking a simple question. When do we use box plots? You should consider using a box plot whenever you're plotting the average or median of a particular data column.
In the example shown on screen, I have the average amount bills to a client for three different fee earners. So for the time period in question, we know that Gary Krug build 1,766 per client, Alexander Blake, 1,867 and Caitir Davison the most on 2,039. As valuable as this chart is, we are missing a huge amount of additional information. For example, how many clients does each fee earner have? Do they have a small number or very large fee earners? Or, are all their clients typically around the same size? These questions can be answered by the equivalent box plot while maintaining the average figure that we've graphed here.
Box plots basically enable you to see the distribution of each customer from smallest to largest and on the X axis, I have the fees per client included.
When you hover over a box plot, you're given five different values, two whiskers, two hinges and median.
You're also given each dot, representing an individual or a customer, ranging from the smallest all the way up to over 10,000, for the following six customers.
When you hover over the box plot, the median value represents the middle client for Alexander Blake if you lined up all the clients and sorted from largest to smallest and this median value is $535.
The hinges are $199 and $993 and they represent the range of the middle 50% of customers and are represented by either side of the box.
This box is also referred to as the interquartile range. The interquartile range is a measure of variability within a data set while excluding outliers. A higher interquartile range indicates a bigger difference in size between clients. And as you can see in this example, Caitir and Gary's clients have a much wider difference while Alexander's clients are much more likely to be the same size once you remove his outliers.
The whiskers represent 1.5 times the size of the interquartile range which is the middle 75% of customers. So for Alexander Blake, you can see that the average value falls outside the middle of 75% of customers.
And the reason is because Alexander has a small number of very large clients that are pushing the average way up beyond the median. In contrast, the two remaining fee earners, have a smaller number of very large clients and as a result, the average values lie within the interquartile range.
If we only plotted the average value per client for these three fee earners, we would be missing out on all of this additional insight that box plots provide.
So how do we build box plots? Well, thankfully, they're pretty easy. Let's start by taking the fee earner, putting the fee earner in a row.
And next, I'll take value and place value in the columns.
Now I want to split the value by client which I place under details.
And then I'll go to the show me options and select the box plot.
To make this easier to read, I'm going to flip the axes, swapping the rows and columns and now I need to apply a filter. So, I'll drag fee earner into filters, I'll select none and then pick my three.
And as you can see, I now have my visualization.
If I place this sheet on a dashboard, this looks pretty good, however, the dots are still much too small. To make them bigger, I go to size on the marks card and drag to the right.
Let's see how this looks.
That looks much better. I could make them a little bigger. So let's return and go one or two more notches.
And now in my dashboard, I have the box plot complete.
If I want to add in average lines, let's return to the sheet, go to analytics.
I'm will add average line, which can be trucked onto the chart.
I'll change the average line option and make sure that's selected as per cell because when it's selected to per cell, it'll calculate the average for each individual fee earner.
When I now return to my dashboard, you can see the average line is in position.
When you have box plots like this, it's very easy to create some nice scenario analysis. For example, what would be the impact on my average fees per client if I remove the biggest client for the three fee earners, bearing in mind that the average ranges between just over 2000 and about 1,776? So let's remove the biggest clients for each one.
So I'll hit exclude, exclude, and one more time.
And as you can see, all the averages drop. Now Alexander Blake is 1,551, Caitir is 1,441 and Gary Krug is 1,571.
The next time you're asked to create a chart that shows the average fees per client or some other similar metric, try and convince your manager to use a box plot instead so you can get a much more insightful visualization for your analysis.