9. Adding a Boosted Model

Overview

Developing your knowledge of Alteryx predictive model tools, in this lesson you will learn how to train a boosted model on your dataset.

To explore more Kubicle data literacy subjects, please refer to our full library.

Summary

  1. Lesson Goal (00:22)

    The goal of this lesson is to add the boosted model to our workflow so we can compare it to the decision tree models we created previously.

  2. Key Steps (00:28)
    1. Add boosted model tool to the workflow

    2. Calculate confusion matrix values

    3. Run the workflow and analyze the results

  3. Understanding Boosted Models (00:51)

    The boosted model, also known as gradient boosting, is a variant of decision trees. It aims to improve on the accuracy of a decision tree through some complex math at each decision point in the tree. As with a decision tree, it can be prone to overfitting, so you should watch out for overfitting when creating a boosted model.

  4. Step 1: Add Boosted Model Tool (01:40)

    We can create a boosted model using the Boosted Model tool. The input for this tool should be connected to our estimation dataset. To configure the tool, we specify the model name, the target field, and the predictor fields.

    As with a decision tree, knowledge of the dataset should inform your choice of predictor fields. You may wish to compare different combinations of fields, as we did when evaluating decision trees. If you want to compare a boosted model to other models, then you should select the same predictor variables for each model, to ensure the comparison is fair.

  5. Step 2: Calculate Confusion Matrix (03:06)

    We can calculate the confusion matrix values using the same method that we used to calculate these values for a decision tree. In our case, we copy the tools that we used for a decision tree, then paste them on the workflow. We attach the Score tool to the D output of the Boosted Model tool and to the validation dataset. This helps us save time by not having to reconfigure these tools for each predictive model we create.

  6. Step 3: Analyze Results (03:59)

    In our case, the decision tree and the boosted model use the same size dataset. That means we can quickly compare both models by considering their true positive values. We can view this value for any of our models by selecting the Summarize tool at the end of the model. In our case, we find that the boosted model has more true positives than our original decision tree, but not as many as the larger decision trees with more variables.

    When we create multiple predictive models, it’s good practice to put each model in its own container. As a result, we do this for our boosted model and decision trees.

Transcript

In the previous lesson, we prepared our decision tree workflow so we could drive confusion matrix data as an output from our three models.

Now that we've set up our three decision tree models for further analysis, we would like to compare these results with other models available in the predictive tab.

Our goal in this lesson is to add the first of these models. The boosted model.

We'll accomplish this goal in three key steps.

First, we'll add a boosted model to our workflow and configure the tool.

We'll then calculate confusion matrix values as we did in the previous lesson.

Finally, we'll run the workflow and analyze the results to see how our models perform.

Before we jump into the lesson, let's take a moment to explain what a boosted model is. The boosted model also known as gradient boosting is based on decision tree models.

As we've seen previously, decision tree models are not perfect. Our best model predicted 85% of the results correctly.

However, even that result is a bit misleading as the decision tree only has a degree of certainty for each record.

The boosted model tries to account for these deficiencies and improve upon decision trees through some complex math at every decision point.

As with other complex models, you'll need to be wary of overfitting when applying a boosted model.

Let's now move on to our first step, connect a boosted model and configure that model.

We'll navigate to the predictive tab on the tools pallet, bring a boosted model onto the canvas and connect it to the estimation dataset from the create samples tool.

We'll navigate to the configuration window, give this model the name boosted_model and select the target field Grant Status.

We now need to select the predictor variables for our analysis.

We'll choose Grant Category Code, Contract Value Band with PHD1, Number of Years in University at Time of Grant1, Number of Successful Grant1 and Number of Unsuccessful Grant1.

We'll stick with these six variables for all new models going forward.

In the real world, you may wish to work with a different combination of target fields as we did with the decision tree models.

As with many of the advanced altryx tools, knowledge of the dataset in question as well as trial and error, should drive your judgment around what variables to choose.

At this point, we're ready to move on to step two and calculate the confusion matrix values.

As a shortcut, we can use the score tool, formula tool and summarize tool from one of the other models as a base.

We'll select those three tools and copy them.

We'll now pace the tools and connect the score tool to both the boosted model, as well as the validation dataset.

We'll then navigate to the formula tool and change the model name formula to boosted_model.

We can now run the workflow.

Note that this calculation may take a little while, so I'll cut out the wait time.

Once the workflow finishes running, we'll be ready to move on to step three and analyze the confusion matrix values.

Since all of these models have the same size sample set, looking at the sum of true positives is a good option for making a quick comparison.

We'll click on the output note of the summarize tool and see that there are approximately 1093 true positives.

Let's compare this with our decision trees.

We'll click on the output for decision tree three and see that it predicted 1128 true positives.

Moving on to decision tree two, we have 1134 true positives.

Finally, decision tree one shows 1015 true positives.

Further analysis will be necessary, but the initial conclusion is that our boosted model is slightly more accurate than decision tree one, but does not offer a significant improvement over the other decision trees.

Before we end the lesson, we'll put the boosted model in its own container.

We'll select all four tools, right click and select, Add To New Container.

We'll name this container, Boosted Model.

Let's quickly recap what we did in this lesson.

First, we added a boosted model to the workflow and configured the tool with six predictor variables.

We then connected a score tool, formula tool, and summarize tool to calculate confusion matrix values.

Finally, we ran the workflow and analyzed the results.

In the next lesson, we'll deploy three new models.