8. Calculating Confusion Matrix Values

 
Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial

Overview

Using the score tool to yield values for your target field, in this lesson you will learn how to convert those values into confusion matrix summary data.

Lesson Notes

Supervised Learning

  • Supervised learning is a type of machine learning that uses a known dataset to make predictions
  • We are engaging in supervised learning with all the models that we use in this course

Developing Confusion Matrices

  • The Score tool is used to predict the Grant Status for each record, with 0 being negative and 1 being positive
  • We use a formula tool to calculate True Positive, True Negative, False Positive, and False Negative
  • Note that the model name formula must be formatted exactly as the name in the model’s configuration window – this is required for the model objects to work correctly in the final lesson
  • We use a Summarize tool to aggregate this information for the model as a whole

 

Transcript

In our look at predictive modeling tools so far, we focused on decision trees which are a form of supervised learning. Supervised learning broadly describes a type of machine learning algorithm that uses a known dataset to make predictions. In previous lessons, we engaged in supervised learning by using a sample of the university grant application data to train three different decision tree models. We then mapped these models to previously unseen data, the validation set to assess each model's prediction quality. The decision tree method attempts to divide the dataset until unique subsets of the target variable emerge. However, decision trees represent just one of a number of supervised learning tools available in Alteryx. Other predictive modeling techniques build on this decision tree idea. In the following lessons we're going to take a closer look at some of these different models. To that end we want to develop a simple way to compare all these models. An obvious answer is to compare the predictive quality of these models by focusing on the confusion matrix. Each confusion matrix will give us easy-to-read metrics on how these models perform. To create these confusion matrices, we'll follow four key steps. First, we'll strip back unnecessary tools and deploy a score tool to the first decision tree. This will allow us to see predictions for each record. Next, we'll connect a formula tool and develop confusion matrices for each record. Third, we use a summarize tool to aggregate the confusion matrix information for the model as a whole. Finally, we'll copy these tools and deploy them to the remaining two models. We'll also organize the models in individual containers to keep our workflow tidy. Let's begin by stripping back our workflow to our three decision tree models and scoring our first decision tree. We'll bring down a score tool from the Predictive tab on the Tools palette and connect it to the first decision tree tool. We'll then connect the other input to the validation dataset from the create samples tool. We'll now run the workflow to see the predictions for each record.

If we scroll to the right, we can see that our predictive outcomes are presented with score zero representing the likelihood that no grant will be awarded and score one representing the likelihood that a grant will be awarded. We can convert this score information into confusion matrix data by simply adding the percentages to derive true positive, true negative, false positive and false negative information for each record.

We're now ready to move on to step two and create these confusion matrix values for each record. To accomplish this we'll bring a formula tool onto the canvas and connect it to the score tool.

Before we start writing formulas we'll create a field that simply has the model name.

We'll name this field Model Name and write decision_tree_1 in quotes in the canvas.

We'll now create a new canvas and start with true negative. To that end we'll name the field True Negative.

This is the probability that the model correctly predicted that no grant would be awarded. To calculate this we'll enter the following conditional function.

If ToNumber Grant.Status equals zero, THEN Score_0 ELSE 0 and if.

We'll also change the data type to FixedDecimal and leave the size at 19.6 to stay on the safe side. This formula will return zero if a grant was awarded for the record and the score information for a negative prediction if the grant was not awarded. We can use the formula as a base for the next three formulas, so we'll copy it and create a new canvas.

We'll name this one False Negative and paste our conditional formula. In this case we want the score information for a negative prediction only if the grant was awarded. To that end we'll just change the grant status to one and the data type to FixedDecimal. We'll create another canvas, name it True Positive and paste the formula. In this case we want the score information for a positive prediction only if the grant was awarded. To that end we'll change the grant status to one, the score from zero to one and the data type to FixedDecimal. We'll now create the final canvas, name it False Positive and paste the formula again. In this case we want the score information for a positive prediction only if the grant was not awarded. This time we only need to change the score from zero to one and the data type to FixedDecimal.

We're now ready to move onto step three and aggregate data from the record level to the entire model.

We can do this by bringing down a summarize tool and connecting it to the formula tool. We'll group the data by model name, sum the true negative, sum the false negative, sum the true positive, and sum the false positive. Finally, we need a count of the total number of fields so that we can calculate percentages downstream. To do this we'll simply count the Grant.Status field.

At this point we're ready to move on to the final step, apply these tools to the other decision trees and put the models in individual containers. We can easily apply these tools to the other models by copying them.

We'll paste the first set of tools connecting the score tool to both the second decision tree and the validation set.

We'll also go to the formula tool and change the model name to decision_tree_2.

We'll now paste the tools again this time connecting the score tool to the third decision tree and the validation set. Again, we'll go to the formula tool and rename this model decision_tree_3.

Let's now put these three models into their own containers starting with decision_tree_1. We'll select all four tools, right click and select Add to New Container.

We'll then name the new container Decision Tree 1 and move onto the next model. Again, we'll select all four tools, right click and select Add to New Container. As you might expect we'll name this container Decision Tree 2.

Finally, we'll put the third model in a container. Again, we'll select all four tools, right click, select Add to New Container and name the container Decision Tree 3.

Before we move on let's quickly recap what we did in this lesson. First, we stripped back unnecessary tools and deployed a score tool to the first decision tree. Next, we connected a formula tool and developed confusion matrix information for each record.

We then used the summarize tool to aggregate the confusion matrix information for the model as a whole. Finally, we copied these tools and deployed them to the remaining two models. In the next lesson we'll introduce a different type of predictive model to see if it produces better results.