1. Introduction to Predictive Modelling

Overview

This lesson lays the foundation for the predictive modelling course, setting out the business need and making some fundamental changes to the dataset to facilitate onward analysis.

To explore more Kubicle data literacy subjects, please refer to our full library.

Summary

  1. Understanding Classification Models (00:04)

    Classification is a form of supervised learning, meaning it trains on existing data to build a model to make predictions. Classification aims to identify to which of several possible categories a new observation of data belongs. As a result, the output of a classification model is categorical, or discrete. This is different from linear regression, which provides a continuous output. However, logistic regression provides a discrete, binary output, meaning it is technically a form of classification.

    Alteryx provides a range of classification techniques, including Decision Trees, Boosted Models, Forest Models, Neural Networks, and Naive Bayes.

  2. Overview of the Case Study (01:24)

    The dataset for this course relates to research grant applications at a university. We want to identify the factors that improve the quality of an application so that applications unlikely to succeed are flagged in advance

  3. Reviewing the Data in Alteryx (02:07)

    The dataset contains information on almost 9,000 grant applications over a 4-year period. For each application, the data includes background information on the applicant, details of the size and type of grant applied for, and whether the application was granted.

    In our workflow, we have taken several steps to prepare the dataset before we start analyzing it. First, we use an Autofield tool to assign types for each field in the dataset. Next, we use a Select tool to manually adjust the field type for certain fields where the Autofield tool selected the wrong type. Third, we use a Formula tool to provide values for null data in various fields.

Transcript

In the following series of lessons, we're going to take a deeper look at the predictive modeling tools available in Alteryx.

In a separate Alteryx course, we considered linear regression analysis.

This is considered a supervised learning technique because it trains on existing data to build a model which can make predictions with future data.

In this course, we'll explore another type of supervised learning, classification.

Whereas linear regression provides a numerical or continuous output, classification provides a categorical or discrete output.

Linear regression is not considered to be a classification model, but the same cannot be said of logistic regression.

Logistic regression is actually a form of classification in that it's used to predict a binary outcome, such as pass/fail, churn/not churn, and so on.

Alteryx provides a range of other classification techniques, including Decision Trees, Boosted Models, Forest Model, Neural Network, and Naive Bayes.

Over the course of the following lessons, we'll investigate each of these techniques in turn.

We'll then consider how to compare and then deploy the preferred predictive model.

For these lessons, we're going to use a dataset of university grant applications.

Imagine we work for the grants department at a prominent university and review applications for research grants each year.

More than half of the grant applications fail. This is not only a waste of our time, but it also represents a significant amount of wasted effort on the part of the academic seeking funding.

Our goal in this course is to improve the grant application quality so that applications unlikely to succeed are flagged in advance.

This would not only reduce a burden on the grants department, but also help the university guide applicants so that less time is wasted on submissions unlikely to succeed.

We'll begin by reviewing the grant application dataset.

We can see that it contains information relating to almost 9,000 grant applications over a four-year period.

We have information regarding whether a grant was awarded or not together with background data about the applicant, the size and type of the grant requested, et cetera.

We're going to use the Alteryx predictive tools to consider this information and predict the likelihood that an application is awarded a grant.

With most data projects, you'll want to start by preparing your data.

Fortunately, our data has already been cleansed, so I'll quickly run through what these tools have accomplished.

First, the Auto Field tool was used to automatically select the data type for each field in our dataset.

We then used the Select tool to fix anything instances where the Auto Field tool applied the wrong data type.

We then had to deal with null values.

If we go to the Browse tool and consider the Grant.Catagory.Code field, we can see that that field has almost 1,000 null values. Indeed, many of the string fields in this dataset contain null values.

Leaving these fields blank might've made the results more difficult to interpret. To account for this, many conditional formulas were used to fill these blank fields.

I'm not going to run through each of these formulas right now, but it may be useful to go through them on your own time.

One final note, in many real-world situations, it may be useful to deploy tools from the Data Investigation tab to analyze the dataset for notable features or anomalies.

In this course, we'll only use the Field Summary tool, but it may be worth looking into some of the other tools on your own time.

Let's stop here.

In the next lesson, we'll run through an explanation of the Decision Tree tool, the first model we'll deploy in this course.

Predictive Analytics
Classification Models

Contents

My Notes

You can take notes as you view lessons.

Sign in or start a free trial to avail of this feature.

Free Trial

Download our training resources while you learn.

Sign in or start a free trial to avail of this feature.

Free Trial