4. Creating Databins
Creating databins (or a frequency distribution) is another way of understanding how your data is distributed.
- Databins help you to visually interpret the distribution of your data
- Use max and min values in your dataset to determine how big your databins should be
=FREQUENCY(data_array,bins_array): Array formula that calculates frequency distribution
--- data_array: This is the full dataset
--- bins_array: This is the list of bins
Note: You must pre-select the array of output cells before writing the formula
CTRL + →: Move to end of data region
CTRL + SHIFT + →: Select all cells within data region
SHIFT + →: Select adjacent cell
ALT + E, S, F: Paste formulas
ALT + E, S, T: Paste formats
In the last lesson, we used percentiles to get a sense of how data is distributed. Another common method of doing this is grouping our data into bins and then plotting this data on a histogram. This is called a frequency distribution and can provide some interesting insights into how data is dispersed. Let's first decide on what size our bins should be. We'll first find the max value in the data set.
Close the bracket and press enter. And this gives us a revenue of just under $15,000. Seven to eight columns is plenty for a histogram so let's create bins that are $2,000 big. I'll start off with 2,000 and I'll simply add $2,000 to this cell, and then copy for the remaining values.
Alt + E, S, F to paste the formula. I'll then format these cells by pasting the format from our revenue column. Alt + E, S, T to paste formats.
Once we have the bin size decided, it's time to calculate the number of days that should fit within each bin. While we could do this using the COUNTIF statement, it's much easier to use an in-built array function in Excel called FREQUENCY. First select an array of cells that's one greater than our list of bins.
In the first cell, we'll then type =FREQUENCY and the data_array will simply be our revenues.
And the bins_array will be the bins we've just created and also include the label.
We'll then close the bracket and press Control + Shift + Enter to create the array formula. This formula calculates the number of days that belong in each revenue bin. As you can see from the table of data, by far the most days lie in the 2,000 to 4,000 range with a few big days lying in the $12,000 plus range. We also have two days that are greater than 14,000, so underneath 14,000, I'll simply write "More." Plot this frequency distribution as a histogram, we could simply create a bar chart in Excel, or we could use the in-built histogram function in the Analysis ToolPak. For the sake of variety, let's check out the ToolPak. So in the data tab in the Ribbon, simply select Data Analysis and then select Histogram.
The Input Range is going to be the revenue column, the Bin Range is going to be the bins, and for the Output Range, we simply select a cell that would correspond to the top-left-hand corner of our output, and N2 should be fine for this. I'll also select the Chart Output and press OK.
And as you can see, the Analysis ToolPak has created the same frequency distribution we created earlier and a chart that shows our values.
Often histograms like this represent a faster way of understanding a data distribution than interpreting the full dataset or trying to get your head around percentiles. Frequency distributions are especially useful when the total number of data points within a specific range is more important than the actual data points themselves. Percentiles and frequency distributions are great for understanding how our data is dispersed, but it would be great if we had some metrics that could quantify the level of dispersion. Two related metrics, variance and standard deviation, perform this task in statistics, and we'll learn how to calculate these metrics in the next lesson.