3. Multiple Linear Regression

 
Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial

Overview

Expanding on the last lesson, this lesson will explain the concept of multiple linear regression, which is more common in the real world than simple linear regression.

Summary

Lesson Goal

The goal of this lesson is to learn about multiple linear regression.

Multiple Linear Regression

Multiple linear regression is a regression model which contains multiple predictor variables. For example, with three predictor variables, it takes the form:

Y = 0 + 1X1 + 2X2 + 3X3

Each predictor variable has its own coefficient, which reflects the impact of that predictor variable on the outcome. As before, the model is a linear equation.

Analyzing Regression Outputs

We can’t draw a line chart of the equation when there are multiple predictor variables. Instead, we evaluate a multiple linear regression model by studying the coefficients of the regression equation. Any software that can fit a regression model will produce these coefficients.

The estimated coefficients allow us to create the regression equation for the model. As with simple linear regression, this lets us predict the outcome variable when we have values for each predictor variable. Regression outputs will also include a column of p-values. The p-value for a predictor variable tells us how likely it is that the predictor variable has a significant effect on the outcome variable. A low p-value indicates the effect of the predictor variable is significant.

Evaluating Regression Models

We evaluate multiple linear regression models using Adjusted R2. Like the R2 statistic we saw in the last lesson, this has a value between 0 and 1, and a higher value indicates a better fitting model.

Transcript

In the previous lesson, we introduced regression analysis and looked at simple linear regression. In this lesson, we'll learn about multiple linear regression, which deals with situations where multiple predictor variables influence an outcome variable.

In the real world, simple linear regression is not common.

It's rare for an outcome of interest to be influenced by just one predictor variable.

It's more likely that an outcome will be influenced by several different predictor variables.

For example, in the previous lesson we saw that temperature can be used to predict ice cream sales. However, it's likely that many other variables like time of year, day of week, and price of ice cream, could also affect ice cream sales.

With multiple linear regression we can find out which predictor variables have a greater or lesser impact on an outcome variable. In the previous lesson, we saw the equation for a simple linear regression. Here, the outcome variable is denoted by Y and the predictor variables are denoted by X. Betas denote the coefficients of the model, which we explained in the last lesson.

For a multiple linear regression we simply add more predictor variables.

For example, with three predictor variables the equation would look like this.

Our three predictor variables are denoted X1, X2, and X3.

They each have their own beta coefficient, which reflects the fact that each predictor will have a different influence on the outcome. As before, the model is expressed in the form of a linear equation so we need to be confident that the underlying relationship between the outcome and each predictor variable is linear.

Because of the extra predictor variables we cannot visualize the fitted regression as we did in the previous lesson. Instead, when analyzing multiple linear regressions we focus on analyzing the coefficients of the regression equation. As before, we're not going to discuss the exact details of how the model is fitted but we'll instead focus on how to interpret the results with an example.

Here, we see data for an online retailer. The table shows the output of a fitted multiple linear regression to their data.

The outcome variable is sales, and the table shows us three predictor variables that may influence sales. We're primarily interested in two columns in this table: the estimate column and the p-value column. The estimate column contains the various beta coefficients in the fitted regression model. The estimated intercept of 40.51 refers to the beta zero value, while the other estimates refer to the values of beta one, two, and three. As before, we can use these estimated coefficients to estimate sales for a future day if we know its values for the bounce rate, clicks, and paid search variables.

When we fit multiple linear regression like this, we don't just want to know what the fitted equation is.

We also want to know whether the predictor variables actually influence the outcome variable.

We can do this by observing the p-value for each of the predictor variables.

It's possible that each predictor variable has no real effect on the outcome variable and that any effect we see in our data is just down to random chance.

The p-value, ranging from zero to one, represents the results of a statistical test that tells us how likely this is.

A low p-value implies that the predictor variable and outcome variable are related.

A higher p-value suggests there's no relationship between the predictor and the outcome.

We generally consider a p-value below 0.05 to indicate that a predictor is related to an outcome.

In our output, all the variables have very low p-values. In fact, the p-values are so small the computer program couldn't measure them precisely. This indicates that all three of these predictor variables have a significant impact on our outcome variable.

Finally, let's consider how to evaluate the usefulness of a multiple linear regression model. In the previous lesson, we learned about the coefficient of determination called R-squared.

In a multiple linear regression, we use a metric called adjusted R-squared.

This accounts for the presence of multiple predictor variables. Adjusted R-squared generally ranges from zero to one, although on rare occasions it can be negative. As with R-squared, higher values are better.

Our model has an adjusted R-squared of 0.933 indicating it's a good model for this data.

This concludes our look at multiple linear regression. When you run a regression you'll usually have a selection of possible predictor variables that could influence an outcome. Multiple linear regression helps you identify which of these predictors are actually related to the outcome variable, and to what extent. As a result, it's a common tool in modeling data.

In the next lesson, we'll look at the final type of common regression called logistic regression.