6. Validating our Model

Overview

In this lesson, you will learn how to validate various trained models and contrast their performance using a lift chart.

To explore more Kubicle data literacy subjects, please refer to our full library.

Summary

  1. Lesson Goal (00:27)

    The goal of this lesson is to compare each model’s performance against the validation dataset using the lift chart tool.

  2. Key Steps (00:35)
    1. Calculate the grant award percentage rate for the original dataset

    2. Configure a lift chart tool using this information

    3. Run the workflow and analyze the results

  3. Step 1: Calculate Grant Award Percentage Rate (00:58)

    In order to configure a lift chart, we need to know the average rate of grants awarded. We calculate this using two formula tools and a Summarize tool.

    The first Formula tool is attached directly to the grant application dataset. In this tool, we convert the Grant Status field, which has a value of 0 or 1, to a number. We then use a Summarize tool to sum this field, which tells us the total number of grants awarded. We also count the number of values in this field to identify the total number of applications. Finally, we use another Formula tool to calculate the percentage of applications that receive a grant by dividing the number of grants by the number of applications. In our dataset, we see that the grant award rate is 45.8%.

  4. Step 2: Configure a Lift Chart (02:56)

    We use a lift chart to compare our decision tree models. First we combine the three models using a Union tool, which we connect to the O output node of each model. Then we create a lift chart using the Lift Chart tool, which we connect to the unioned decision trees and the validation dataset.

    When configuring the dataset, we can specify the type of lift chart, which we specify as a total cumulative response chart. We also need to specify the true response rate, which is the grant award rate we calculated earlier. We also need to specify the desired target level, which is 1 in our case.

  5. Step 3: Analyze the Results (04:15)

    The output of the lift chart tool shows us a chart comparing the sample proportion and the percent of total response captured. The sample proportion represents the proportion of grant applications being analyzed, while the percent of total response captured represents the proportion of awarded grants that are correctly identified by the model. Each decision tree is represented by its own line on the chart. This information is also provided in the form of a table.

    The diagonal line represents the performance of a model predicting grant applications at random. The further a model’s line is above this line, the better it performs. The ideal model would capture a high percentage of the total response using only a low sample proportion. As a result, the best model is generally the one that is furthest above the diagonal line.

    In our case, there is no clear best model, as the lines for each model overlap and cross each other several times. In this case, it’s best to choose the simplest model, as the increased complexity of adding more variables appears not to make a significant difference. In this case, this means selecting the model with four predictor variables.

Transcript

In the previous lesson, we compared three different decision tree models to see if various combinations of the predictor variables would lead to greater accuracy.

We found that our second and third models both offered improved accuracy over the initial model.

However, these comparisons were against the data set we used to create each of the models.

In this lesson, we're going to compare each of the models performance against the validation data set using the lift chart tool. We'll accomplish this goal in three key steps.

First, we'll go back to the original data set and calculate the grant award percentage rate.

Next we'll configure our lift chart tool using that information.

Finally, we'll run the workflow and analyze the results.

In order to compare each of the models, we'll use a lift chart tool.

However, to configure this lift chart tool correctly, we must first calculate the average rate of grants awarded, that is to say the total number of grants divided by the total number of applications.

To calculate this metric, we'll bring our formula tool onto the canvas and connect it to the grant application data set.

We'll create a new field called grant status number, enter a two-number formula for the grant status field and change the data type to integer 16.

Next we'll connect the summarize tool to the formula tool.

We'll set the tool to sum the grant status number field and rename it to number of grants.

We'll also count the total number of applications by referencing the same field and renaming this field applications.

We'll now connect the formula tool to calculate the average grant rate.

We'll create a new column called grant rate and simply divide the number of grants by applications.

We'll change the data type to fixed decimal size 12.3 as we would like three decimal places in our result.

We'll now run the workflow.

As we saw previously, this could take some time so I'll cut the wait time out of the video.

We'll look at the results in the formula tool and see that the grant rate of the entire data set is 0.458 or 45.8%.

We'll use this figure in the lift chart tool.

At this point, we're ready to move on to step two and use our lift chart tool to compare the models against the validation set.

Before we bring our lift chart tool onto the canvas, we'll need to bring models together with a union tool.

We'll bring down a union tool and connect the output nodes from each of the three decision tree tools.

We're now ready to connect our lift chart.

We'll bring our lift chart onto the canvas and connect one input node from the union tool.

The second input node will reference a validation data set coming from the create samples tool.

In the configuration window of the lift chart tool, we have the option to create a total cumulative response chart or an incremental response chart.

We'll choose total cumulative response chart for now.

In the true response rate field, we'll enter the grant rate of 0.458 we calculated previously.

We'll also set the target level to one.

We'll now add a browse tool and run the workflow.

Once the workflow finishes running, we'll be ready to move on to the final step and analyze the results of the lift chart.

We'll click on the browse tool and expand the window.

We're presented with a chart contrasting the sample proportion with the total response captured.

The sample proportion on the X axis is the proportion of grant applications being analyzed.

The total response captured on the Y axis represents the proportion of grants awarded that are correctly identified by the model, the dark black line dividing the chart marks the grant rate for the initial data set.

We can see that a hundred percent of our applications equates to a hundred percent of grants awarded.

In this case, it would mean that for every thousand applications, there were 458 grants awarded.

The three colored lines above the grant rate line are the improvements of a random chance achieved by our three models respectively.

What we're looking for is a model which captures a high proportion of grants awarded on a smaller sample proportion.

Put a different way, we're looking for the model that most efficiently identifies which applications are awarded grants.

If we look at the fourth decile or 0.4 on the x-axis, all three models capture just under 65% of responses.

If we scroll down, we can see a table that lays out this data in numerical format.

Again, we can see that at the fourth decile all decision trees capture 64.5% of predicted grant applications.

Notice that at the second and third deciles, decision tree one has the upper hand.

Faced with a decision between these three models, which one do we prefer? Decision tree one requires four predictor variables versus 11 for decision tree two and 22 for decision tree three.

As a general rule, simpler models are better.

Further, decision tree one seems to perform better on smaller sample sizes than the other decision trees.

Based on this initial analysis, we'd prefer decision tree one.

However, it's important to remember that decision tree one is the least accurate of the three models.

This is something to consider as we continue our analysis.

Let's stop the lesson here.

To recap, we compared the performance of our three models against the validation set, using a lift chart tool.

We achieved this in three key steps.

First, we calculated the grant award rate so that we could apply it to the lift chart tool.

We then configured the lift chart tool to compare our models against the validation set.

Finally, we ran the workflow and analyzed the results.

We've spent the last several lessons looking at decision tree tools.

However, other predictive models may provide even better results.

Before we look into those other models, we'll take a deep dive into the confusion matrix in the next lesson.