3. Configuring Tree Models

 
Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial

Overview

In this lesson you will learn the formative steps to take when deploying your first decision tree model.

Lesson Notes

Create Samples Tool

  • The Create Samples tool splits the dataset into different sets, notably an estimation set and validation set
  • This allows users to develop a model on an estimation set, and then validate the model in a validation set

Decision Tree Tool

  • The Decision Tree tool consists of target variables and predictor variables
  • The target variable is the variable that you are trying to predict
  • The predictor variables are key variables that influence the target variable
  • Information from Data Investigation tools, as well as knowledge of the dataset in question, can help determine what predictor variables to choose

Transcript

In the previous lesson, we discussed the concept of a decision tree at a high level. We explained that in predictive analysis, a decision tree attempts to classify the data into groups which influence a target variable. In this lesson, our goal is to configure our first decision tree model. We'll achieve this goal in three key steps. First, we'll split our dataset into a sample set and a validation set.

We'll then deploy and configure a Decision Tree tool. Finally, we'll select the predictor variables, which a tree will use to split the dataset into different branches.

In the time series and forecasting course, we used the Create Samples tool to train a model, and then verify its output. We'll apply a similar process here, and train our predictive models, using a sample of the Grant Application dataset. To that end, we'll navigate to the Preparation tab, and bring a Create Samples tool onto the canvas.

We'll set the estimation sample at 60 percent, and the validation sample at 40 percent. We're now ready for step two, and to introduce a Decision Tree tool to our workflow. We'll navigate the Predictive tab on the Tools pallette, and bring down a Decision Tree tool.

We must connect the estimation output node from the Sample tool to this Decision Tree tool. In the configuration window for the decision tree, we'll begin by giving our model the name, Decision_Tree_1.

We must now target the variable we want to predict. We'll select Grant_Status.

Let's take a moment to understand exactly what question we're asking of our model. Our data is binary. Grant applications were either awarded a grant, or not. We therefore want the decision tree to look for features in our dataset to predict if a grant was awarded or not. In the statistical analysis course, we learned that linear regression is better suited to explaining continuous variables, like sales data, whereas a logistic regressions is more appropriate for binary data. Decision trees can usually be deployed to both these data types. The former is known as a regression tree, and the latter is called a classification tree. The tool handles this distinction automatically. The next step in this lesson, is to select the predictor variables, that is the variables which the tree will use to split the dataset into different branches. There are two approaches we can take here. We can start with everything, and then pare our data back, or alternatively, we can start with only a few predictor variables, and then layer further variables on top, to see if our resulting model is significantly improved. For this example, we're gonna to take the minimalist approach, and start with few variables.

In the real world, obtaining consisting data with multiple variables can be expensive, if it's even available. Therefore, when predictive modeling, it's good to start with less. You can always look for more later. Let's review the output from the Field Summary tool to help us determine which variables to include in the decision tree. As we look through this report, we can see that Contract.Value.Band and Grant.Category.Code display an interesting spread of results.

It makes intuitive sense that how much people are asking for, and for what general purpose, should both be important inputs for grant decisions. We'll navigate back to the Decision Tree tool, and include both these fields in our set of predictor variables. We'll also add two other predictor variables. Whether the lead figure behind the application has a PHD, as well as how many years they've been at the university at the time of the grant.

This is where knowledge of the dataset, and good judgment, come into play. Without experience in this field, or with this organization, it may be difficult to determine the best predictor variables for our analysis. Let's stop here, and recap this lesson. First, we split our dataset into a sample set, and a validation set. As in previous courses, these two sets allow us to test our model on one set of data, and then verify them on a second set.

We then connected the Decision Tree tool, and configured it to predict the grant status of an application. Finally, we selected the predictor variables, which the tree will use to split the dataset into different branches. In the next lesson, we'll run this model and consider the output.