12. Comparing Model Accuracy

 
Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial

Overview

Learn how to combine multiple predictive models, compare their respective accuracy and rank the models accordingly.

Lesson Notes

Organizing the Canvas

  • It is good practice to put different sections of the workflow in containers, so we can easily see which tools are related to each other
  • Making connections wireless removes the messy visual of wires that connect various tools in your workflow

Comparing models

  • A Union tool is used to combine and compare the aggregated data from each model
  • Data is converted to percentages, and sorted based on accuracy and false positive percentage

Transcript

Over the previous lessons we've deployed seven predictive models to our grand application data set. We've scored each of these models and converted the resulting output into confusion matrix data, which will enable us to make an impartial decision as to the preferred models. In this lesson, our goal is to compare our models and determine the three top performers. In later lessons, we'll apply these models to a new data set and see the results. To accomplish our goal here, we'll follow three key steps. First, we'll reorganize our canvas so that the different branches are easier to follow. Next, we'll combine the outputs from our models and reformat the data so that they're easier to compare.

Finally, we'll sort our models by accuracy and compare the results. For the first step in this lesson, let's tidy up the canvas. As we can see, our canvas is getting quite cluttered and difficult to read. To clean it up, we'll make some of our connections wireless. Given the fact that our models are in their own containers, the logical step here is to make all incoming connections wireless. Since all these connections are coming from the Create Samples tool, this is relatively easy to do.

We'll right click on the Create Samples tool and select Make Outgoing Connections Wireless. The physical connections are now replaced by a wireless symbol for each of the affected nodes. If we want to see connections to or from a specific node, we simply click on that node and those connections show up.

Now that our canvas is a bit less cluttered, we're ready to combine our outputs and reformat the data.

We'll begin by bringing a Union tool onto the canvas and connecting it to Summarize tools from each model.

In our effort to keep things tidy, we'll right click on the Union tool and make the incoming connections wireless. We'll now run the workflow to see our combined data. Just as in previous lessons, this may take some time to process, so I'll cut out the wait time in this video. We can see that our seven models are presented in the results window together with the confusion matrix data.

These numbers will be easier to understand as percentages so we'll add Formula tool to create those fields.

We'll name the first field True Negative Percentage...

And enter the formula True Negative divided by Count.

We'll change the data type to Fixed Decimal size 12.2 as we only need two decimal places. We'll now add a new canvas... Name the field False Negative Percentage...

And enter the formula False Negative divided by Count.

Again, we'll change the data type to Fixed Decimal size 12.2.

We'll now add another canvas, name this field True Positive Percentage...

And enter the formula True Positive divided by Count.

Again, we'll change the data type to Fixed Decimal size 12.2.

We'll now add a fourth canvas... Name this field False Positive Percentage...

Enter the formula False Positive divided by Count...

And again change the data type to Fixed Decimal size 12.2.

Finally, we'll create one last canvas and name the field Accuracy.

This is simply True Positive Percentage plus True Negative Percentage.

We'll enter the formula in the canvas... And again change the data type to Fixed Decimal size 12.2.

At this point, we'll move on to the final step and sort our models based on how accurate they are and compare the results. To that end, well add a Sort tool to the Formula tool.

We'll then sort the models by accuracy in descending order.

We're only calculating our percentages to two decimal places so there may be some instances where two models have the same accuracy.

In these cases, we need to choose an alternative selection criteria.

We want to minimize false positives, so for our second sort criteria, we'll choose false positive in ascending order. We'll run the work flow one last time to see how the models compare.

Again, this will take some time to process so I'll cut through the wait time in this video.

We can see that our seven models are sorted with the forest model coming out on top, while the second and third decision tree models were the next best performers. Before we end this lesson, let's quickly recap what we've accomplished. First, we made some of the connections wireless so that it's easier to follow the different branches of the work flow. Next, we combined our model outputs and reformatted the data so that the models are easier to compare.

Finally, we sorted our models by accuracy and compared the results. In the next lesson, we'll take our top three models and prepare to deploy them across a new data set.