1. Understanding Linear Regression


In this lesson, you'll learn the basics of linear regression and how it can be applied.

To explore more Kubicle data literacy subjects, please refer to our full library.


  1. Regression in Alteryx (00:12)

    Alteryx can help us conduct linear and logistic regression. To do this, you need to download the Alteryx predictive analytics add-in. This can be downloaded from the Alteryx admin portal. If the add-in is downloaded, you will see a predictive tab on the tools palette.

  2. Lesson Goal (00:41)

    The goal of this lesson is to understand what linear regression is and why it can be useful.

  3. Course Case Study (00:48)

    A construction company uses different blends of concrete to achieve different properties. The company experiments with different material combinations, costing time and money. They want to understand more precisely what outcomes will result from particular material combinations.

  4. Linear Regression Functions (01:20)

    A function takes in input values, transforms those values according to some specific parameters, and outputs the transformed value. For our construction company, the input values are a blend of different concrete ingredients, and the output is the compressive strength of the blend. The challenge is to identify the transformation that occurs within the function.

    Predictive models like linear regression analyze inputs and outputs and aim to identify the function that transforms the specified inputs into the specified outputs. Some complex models, like neural networks, are referred to as black boxes, as we cannot see how they transform their inputs. By contrast, models such as linear regression are completely transparent, allowing us to apply the linear regression function to data beyond the data used to create it.

    Linear regression examines a series of inputs and outputs, and produces a formula that aims to map the inputs to the outputs. We can then use this formula to predict outputs for future inputs.

  5. Linear Regression Plots (03:20)

    A scatter plot of inputs and outputs can be used to visualize the concept of linear regression. A perfect model would constantly bend and curve to hit every point. Linear regression models the relationship between the points using a single line. This line exactly hits few, if any, of the points, but it should be close enough for most points. This example illustrates that linear regression requires data to be in a roughly linear shape in order to be effective. This happens when our data is correlated.


In the next series of lessons, we'll look more closely at some of the analytical capabilities of Alteryx focusing on linear and logistic regression.

Before proceeding further, you should ensure that you have downloaded the Alteryx predictive analytics add-in.

You can check this easily by looking for the Predictive tab on the tools palette.

If it's not there, you can download it from the Alteryx admin portal.

This add-in contains many pre-packaged tools to assist in the predictive analytics process.

Once you've installed the suite of predictive tools, you're ready to begin.

In this lesson, we'll understand what linear regression is and why it can be useful.

We'll start by looking at a particular case study which will demonstrate how linear regression can be used to create value.

We'll be working with a construction company that routinely uses different blends of concrete to achieve different properties such as flow and compressive strength.

The company experiments with lots of different material combinations, but mixing them takes time and money.

What they really need is a method to figure out these properties without having to mix them every time.

Suppose they could do it with a simple algebra function.

Functions work by taking input values, transforming those inputs based on specific parameters, and outputting the transformed value.

Let's consider a function where the input values are a blend of different concrete ingredients such as kilograms of water and cement and the output value is the compressive strength of the blend measured in megapascals.

This would be incredibly useful as it would allow us to find the quality of different blends very quickly and cheaply.

But how do we find the mysterious parameters that make up this function? Just like a function in algebra, if you know the input and the output, you can work out what the function might actually be.

This is how a lot of predictive models work and is especially common for linear regression.

Complex models like neural networks use algorithms we can't even fully understand.

These types of algorithms are often called black boxes because we can't see what's going on inside.

Regression on the other hand is totally transparent.

It provides you with a simple algorithm that you can even use to manually calculate your predictions.

It works by reading a dataset of input values with their associated output, and examining the relationship between the two.

It will then produce a formula that will try to replicate this pattern.

Although it can read any number of input values, it's typically limited to a single output value.

This creates a kind of one-size-fits-all solution because we can use this singular formula for a wide variety of future inputs.

The drawback of the one-size-fits-all approach is that there is some degree of sacrifice on accuracy.

If we plot our inputs and outputs on a scatterplot, we can see why this might be the case.

A perfect model would be a line that bends and curves to flawlessly meet every data point.

Having a one-size-fits-all formula which is the same for all inputs, means that our line can't turn at all.

It's just a straight line that shoots through our plot.

This is why it's called linear regression.

It misses most of our data points, but if it's a good model, it will be close enough most of the time.

However, as you can see, the accuracy of our model depends a lot on our data being naturally arranged in a straight line.

This happens when our data is correlated.

In a future lesson, we'll see that correlated data is an important prerequisite for building a regression model.

Let's stop the lesson here.

In the next lesson, we'll look at the construction company's data, and see how linear regression might help with their situation.