2. Simple Linear Regression


This lesson explains regression analysis and the basic principles associated with it. It also explains simple linear regression, the most basic form of regression, using an intuitive example.

To explore more Kubicle data literacy subjects, please refer to our full library.


  1. Lesson Goal (00:16)

    The goal of this lesson is to learn about regression analysis and simple linear regression.

  2. Regression Example (00:28)

    In this lesson, we consider the relationship between temperature and ice cream sales. We want to see if temperature influences ice cream sales. We call ice cream sales the dependent variable, or the outcome variable. We call temperature the independent variable, or the predictor variable.

    Linear regression is a model where we assume that a linear equation can explain the relationship between the outcome variable and the predictor variables. Simple linear regression is a linear regression model with only one predictor variable.

  3. Regression Model (02:52)

    Any statistical software application will find the appropriate fitted regression line, that is the line that best fits your data set. The output of a linear equation model will be an equation. 

    The equation that best fits your data will be created by any statistical software. We can then use this equation to predict ice cream sales for any day where we know the temperature.

  4. Evaluating the Model (06:00)

    We can measure how closely the regression line fits our data, using the coefficient of determination, better known as R-Squared. This has a value from 0 to 1, where a higher value indicates the fitted regression line is a close fit to the data.


In the next few lessons, we'll introduce the concept of regression analysis.

Regression analysis is a statistical technique used for analyzing the relationship between variables in a data set.

In this lesson, we'll learn about regression analysis and simple linear regression.

We'll do this by analyzing the relationship between temperature and ice cream sales.

Here we see a scatter plot of ice cream sales and temperature; each dot represents the sales for a particular day as well as the temperature in degrees celsius on that day.

Drawing a scatter plot like this should be the first step in conducting a linear regression.

We can see that there appears to be a positive relationship with sales increasing as temperature increases.

When conducting regression analysis, we generally identify one variable with a value of interest to us.

This is called the dependent variable or the outcome variable.

In this case, ice cream sales is the dependent variable.

We then identify a series of variables that we think might influence the value of the dependent variable.

These are called independent variables or predictor variables.

Here we have only one predictor variable, temperature.

We think the temperature can be used to predict sales. In linear regression, we assume that a linear equation can explain the relationship between the outcome variable and the predictor variables. In simple linear regression, we have only one predictor variable.

When we have multiple predictor variables, we use multiple linear regression which we'll see in the next lesson. A regression analysis determines the line of best fit between these data points.

Let's see what this looks like.

This line is called the regression line.

It predicts the ice cream sales for the range of temperatures found in our data set.

Let's briefly consider how this line is calculated.

The most common method for creating a regression line like this is called ordinary lease squares, or OLS.

OLS fits a regression line that minimizes the square distance between each data point in the line.

In practice, you'll never create a regression line manually, so you don't need to worry too much about how the line is fitted.

Instead, a regression line like this will generally be created through a software program.

We'll just look at the results of this process.

When we fit a regression line like this, we'll be given the equation of the line which may look something like this.

More generally, we can express the equation of a linear regression like this.

The outcome variable is represented by the letter Y.

The predictor variables are represented by the letter X.

In our case, we only have one predictor variable which is denoted by X-1, however, we can have many predictors which would be denoted by X_2, X_3 et cetera.

The term beta zero refers to the intercept.

This is the point where the fitted line crosses the x-axis.

In our example, this is 50.

This represents the number ice creams we would expect to sell if the temperature was zero degrees.

The beta one term refers to the slope of the regression line.

In our example, this is 140.

This is the change in the number of ice creams sold if the temperature increases by one degree.

Finally, we have the error term denoted by epsilon.

As we saw in the chart, the points often don't fall exactly on the fitted line.

In a linear regression, we assume that this is due to random deviations from the model represented by the error term.

Because the error is random, we can't know what value it will have for a particular observation.

Therefore, we assume it equals zero when making predictions.

Once we know the equation of a linear regression, we can use it to make predictions for the outcome variable.

In this case, when we have temperature data for a particular day, we simply solve the equation for the relevant temperature to find the amount of ice cream we can expect to sell on that day.

With this information, an ice cream vendor could use temperature data from the weather forecast to predict how much ice cream they will sell in the future. A linear regression assumes that the outcome variable can be modeled as a linear function of the predictor variables.

However, this won't always be the case. For example, consider these charts. Notice Anscombe's Quartet.

The same regression line can be fitted to all four of the data sets but it's only a sensible line in the case of the top left panel. In the top right panel, the real relationship appears to follow a curve. So, fitting a line is not appropriate. In the bottom left panel, an outlier causes the wrong line to be fitted.

In the bottom right, an outlier gives the impression of a linear relationship when such a relationship doesn't seem to really exist.

In three of these four cases, a linear regression would be a bad choice of model. Finally, let's evaluate how well the regression line fits the data by returning to our ice cream example. The regression line seems to be a pretty close fit, but how can we measure just how close of a fit it is? We can do this using the coefficient of determination, also known as R-Squared. R-Squared measures how much of the variation in the outcome variable can be explained by the predictor variables. It's expressed as a proportion from zero to one. We won't discuss the details of how to calculate R-Squared as it will be calculated for you any time you run a regression in any software package.

For our ice cream data, the R-Squared value is approximately 0.94.

This is very high, and suggests 94% of the variation in ice cream sales can be explained by variations in the temperature.

This suggests that the linear regression was a good model for this particular data.

In this lesson, we've introduced the concept of regression analysis and linear regression.

The example we've seen here is a case of simple linear regression where we have only one predictor variable.

However, it's more common to have several predictors.

In the next lesson, we'll see how to deal with several predictors by using multiple linear regression.