Sign in or start a free trial to avail of this feature.
4. Creating Databins
Creating databins (or a frequency distribution) is another way of understanding how your data is distributed.
Data Bins (00:04)
Data bins divide a data set into groups according to its values. They can be used to understand how a data set is distributed. When creating bins, we need to decide how many bins to create, and what size they should be.
For our sports bar, the highest daily revenue amount is just under $15,000. As a result, we create bins with a size of $2,000. This means the first bin contains all days with a revenue between $0 and $2,000, the second bin contains all days with a revenue between $2,000 and $4,000, and so on. This means we’ll have a total of 8 bins.
Frequency Distribution (01:10)
A frequency distribution tells us the number of values that are contained in each bin. In Excel, we can create a frequency distribution using the function FREQUENCY. This is an array formula which takes two arguments. First is the data set which we want to find the distribution of. Second is the bins to use to divide the data. The function then returns the number of values in the data set that fall within each bin.
A histogram is a chart that represents a frequency distribution visually. We can create a histogram using the Analysis ToolPak. To do this, we select the Data Analysis option from the Data tab on the ribbon, and select a histogram as the analysis tool we want.
When creating a histogram, the input range is the data whose distribution we want to study, such as the daily revenues for the sports bar. The bin range is the bins we are using to divide the data. The default output of the Histogram tool is a frequency distribution table, but we can specify a chart output as well. This chart shows us how the data is distributed between the bins we defined.
In the last lesson, we used presenters to get a sense of how data is distributed. Another common method of doing this is grouping your data into bins and then plotting this data on a histogram. This is called a frequency distribution, and can provide some interesting insights into how data is dispersed. In this lesson, we'll create a frequency distribution for our sports bar data. Let's first decide on what size our bins should be. We'll first the max value in the data set.
Close the bracket and press enter. And this gives us a revenue of just under $15,000. Seven to eight columns is plenty for a histogram. Let's create bins that are $2,000 big. I'll start off with $2,000 and I'll simply add $2,000 to this cell.
And I copy for the remaining values.
Alt E-S-F to paste the formula.
I'll then format these cells by pasting the format from our revenue column.
Alt E-S-T to paste formats.
Once we have the bin size decided, it's time to calculate the number of days that should fit within each bin. While we could do this using the count if statement, its to use an inbuilt array function in Excel, called frequency. First, select an array of cells that's one greater than our list of bins.
In the first cell, we'll then type equals frequency, and the data array will simply be our revenues. And the bin's array, will be the bins we've just created, and also include the label.
We'll then close the bracket and press Control Shift Enter to create the array formula. This formula calculates the number of days that belong in each revenue bin. As you can see from the table of data, by far the most days lie in the 2,000 to 4,000 range with a few big days lying in the $12,000 plus range.
We also have two days that are greater than 14,000. So, underneath 14,000, I'll simply write "more." To plot this frequency distribution as a histogram, we could simply create a bar chart in Excel or we could use the inbuilt histogram function in the analysis tool pack. For the sake of variety, let's check out the tool pack. So, on the data tab in the ribbon, simply select data analysis and then select histogram.
The input range is going to be the revenue column. The bin range is going to be the bins. And for the output range, we simply select a cell that will correspond to the top left hand corner of our output. And N2 should be fine for this.
I'll also select the chart output and press OK.
And as you can see, the analysis tool pack has created the same frequency distribution we created earlier and a chart that shows our values.
Often, histograms like this represent a faster way of understanding a data distribution than interpreting the full data set or trying to get your head around percentiles. Frequency distributions are especially useful when the total number of data points within a specific range, is more important the actual data points themselves. Percentiles and frequency distributions are great for understanding how our data is dispersed, but it would be great if we had some metrics that could quantify the level of dispersion. Two related metrics, variance and standard deviation, perform this task in statistics. And we'll learn how to calculate these metrics in the next lesson.