8. X-Y Scatter Plots and Trendlines

Overview

Scatter plots are the ideal option when you want to display a full dataset on a chart and identify trends within the dataset. PowerPoint's internal tools also help you create regression trend lines and equations for your data.

To explore more Kubicle data literacy subjects, please refer to our full library.

Summary

  1. Lesson Goal (00:00)

    The goal of this lesson is to create an XY scatter plot, and add a trendline to this scatter plot.

  2. Uses of a Scatter Plot (00:04)

    An X-Y scatter plot, or simply a scatter plot, is useful when we want to find trends in data or divide data into clusters. The primary business situations where you would use a scatter plot is when you have a large number of data points, and want to deduce a pattern or divide the points into segments. For example, we use a scatter plot to identify the relationship between the annual car insurance premium and the age of 100 customers of an insurance company.

  3. Creating a Scatter Plot (00:54)

    The scatter plot is simply a series of points. Each point is defined by its x and y coordinates. To create a scatter plot, we select the Insert tab, then Chart, and then Scatter. The scatter plot contains two columns of data, representing the x-axis values and y-axis values of each point on the plot.

    Scatter plots can be formatted in the usual way. It can often be a good idea to remove the horizontal and vertical lines from the chart by selecting them and pressing Delete. You should add labels to the x-axis and y-axis. In our case, we also format the x-axis and y-axis values to remove any decimal places.

    Analyzing the scatter plot provides insights into the relationship between the two variables on your chart. For example, on our chart we can see that the insurance premium is higher for younger customers. We can also see that premium amounts vary more for older customers than for younger customers. 

  4. Adding a Trendline (02:19)

    You can obtain insights from a scatter plot by observation, but you can obtain more accurate insight by adding a trendline. A trendline is a regression line that tries to measure the relationship between the variables on the scatter plot.

    To add a trendlines, we right click a point on the plot, and select Add Trendline. This opens the Format Trendline pane. We can then select the type of trendline under Trendline Options. The most common line type is a linear trendline. The Format Trendline pane also allows us to display a regression equation and R-squared value on the scatter plot. It’s a good idea to add these to the scatter plot and place them in a visible location on the plot. The Format Trendline pane also allows us to modify properties of the line such as color and thickness.

  5. Interpreting the Trendline (03:35)

    The equation of the trendline can be used to predict future y-values (in our case insurance premium) for a given x-value (in our case age). The R-squared value ranges from 0 to 1, and indicates how accurately the regression line predicts the current data. Generally, we need to use more than one variable to get a high value (such as 0.9 or greater) for the R-squared value.

Transcript

X Y scatter plots tend to be used a lot in academia, primarily to show the results of experiments, and to try and deduce a pattern from these results. They are also used a lot in engineering and operations when measuring metrics such as mean time between failures. For business related tasks, I find that I use the X Y scatter plot a lot less than the bar chart or line chart. But it is very useful for one particular task. And that is when I have a lot of data points and I want to deduce a pattern in the data or group the data into different segments. To show the X Y scatter plot in action, let's take the sample case of an insurance company that has 100 customers who pay an annual payment every year. I'd like to see if a relationship exists between the annual premium and the age of the individual customers. So I'll start by creating the X Y scatter plot and off camera, I'll paste in my data. With the age being on the x-axis, and the premium amount on the y-axis.

I'll now minimize the Excel sheet, and begin to format our chart.

I'll start by extending its width.

And I'll also format the y-axis. So right click, and format axis.

I'll go the number, and remove the decimal places which I don't really need here.

I'll also remove the horizontal lines, and off camera I'll add labels to the x and y axis using text boxes.

Now let's take a look at some of the insights from our data. The insurance company in question only provides car insurance. And in the top left hand corner, we can see that young drivers pay a very large premium for this insurance. As drivers get older, the premiums decrease but also, the premium begins to vary a lot more within this age group. This could be due to the different types of car that people drive, and the number of accidents that they've had in previous years. From the scatter plot, we can discern that young people pay more for their car insurance than older people. But what is the nature of the relationship between age and the insurance premium, and can we quantify it? Well, here's where the X Y scatter comes into its own, as it has some additional tools for answering these questions. We'll start by clicking on one of the points, right clicking, and add trend line.

A trend line will try to find a regression line to quantify the relationship between age and the insurance premium. From viewing our data, I think the relationship is likely to be linear, so I'll leave the regression type at this option.

I'll also select display equation, and display R-squared value, which I'll explain shortly.

I'll also change the line color to red.

And a thickness of, say, two points to make it easier to read. I'll then close, and we can see that a trend line has been added to our chart. I'll be sure to move the equation and the R-squared value up to the top right so that it's readable.

The R-squared value determines how accurate our regression line and our equation are. If R-squared was equal to one, it would mean that our equation would perfectly predict the premium amount for a given age. With our current equation, we can now calculate the premium amount y, for given age x. But the equation will contain some errors. To improve the accuracy of this equation, an increased R-squared to say above 0.9, we might need to add other variables to the equation, such as location, car size, and driving history. If you're interested in learning more about regression, be sure to check out my upcoming course on statistics in Excel. For now, I'll wrap up this slide by adding a slide title and a chart title off camera.

As an exercise, try to remove the negative values from the y-axis, and also start the x-axis at 10 instead of 0.

This should serve to improve the formatting of your chart. I'll leave my answer in the after file below the video.

PowerPoint Essentials
Construct Charts for Your Data

Contents

My Notes

You can take notes as you view lessons.

Sign in or start a free trial to avail of this feature.

Free Trial

Download our training resources while you learn.

Sign in or start a free trial to avail of this feature.

Free Trial