5. ARIMA Analysis

 
Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial

Overview

An in-depth examination of a time series analysis continues in this lesson. We will run our first time series model using the ARIMA tool, and considered the results.

Lesson Notes

Create Samples

  • The Create Samples tool allows users to split the data into two sets
  • Users can then test models on the estimation set, and validate the model with the validation set.

ARIMA

  • This tool estimates a time series forecasting model using an autoregressive integrated moving average method
  • The output displays a set of forecast error measures, however mean absolute percentage error and mean absolute scale error are the most relevant
  • The closer these errors are to 0, the better.

Transcript

In this lesson we'll continue our exploration of the Alteryx time series tools. So far we've connected to the online marketing dataset and aggregated that data to the weekly level. We then investigated our data using the TS Plot tool noting that our data looks suitable for forecast modeling. Remember, our overall goal is to generate a sales forecast for the coming year based on our historic data. Let's quickly run through what we'll do over the next three lessons. First, we'll partition our data into an estimation set and a validation set to increase the validation of our modeling. We'll then use the estimation set to compare different time series models. Next, we'll cross reference those models with a validation set to make sure that the results are consistent. Finally, we'll run the entire dataset through our preferred model yielding our forecast results. In this lesson we'll run an ARIMA analysis on an estimation set of our data and analyze the results to see if the model is suitable for our forecast. We'll accomplish this task in three key steps.

First, we'll split our dataset into an estimation set and a validation set. Next, we'll run the estimation set through an ARIMA model. Finally, we'll consider the results of the ARIMA model to get an idea of whether or not its appropriate for our forecast.

The first step is to split the dataset into an estimation set and a validation set just as we did in the statistics course.

This will allow us to test our model on a statistically significant set of data and then later test the results against the validation set. We'll split our data into these two sets with the create samples tool. We'll navigate to the Preparation tab on the Tools pallette and connect the create samples tool to our workflow. In the configuration settings notice the estimation set is referred to as estimation sample. We'll set this to pick up 70% of the data with the validation sample picking up the balance. Note there are three outputs from the create samples tool. The E output node is for the estimation set. This is our training data which will connect to our different time series model tools. Now that we split our data into two sets, we can run the estimation set through an ARIMA tool. We'll start by navigating to the time series tab and connecting the ARIMA tool to our workflow. In the configuration window we'll give the model the name Sales_ARIMA targeting weekly sales with a frequency of weekly.

Moving to the other options tab we'll select the series starting period box, enter the year 2014 and then choose 52 periods for the forecast plot. Finally, we'll add all browses to the tool.

Note that there's a tab for Model Customization. The options in this tab allow you to manually adjust the model. While it may be worth investigating these options when you're conducting your own analysis, Alteryx's research indicates that the automated methods outperform manually specified models for all except the most experienced users. If your forecast output is an unsatisfyingly straight line it typically means that either the time series is not sufficiently long enough to find patterns in the data or there are no systematic patterns in the data. In this case the average of the series is the best possible estimate.

We'll now run the workflow.

At this point we can move onto the final step of the lesson and consider the output of the ARIMA model. The browse tool connected to the O output node provides us with a summary of our data.

The browse connected to the R or report node shows a statistical summary of the ARIMA model including auto correlation function plots. Notice a number of forecast error measures including mean error, square root of the mean squared errors, mean absolute errors, mean percentage error, mean absolute percentage error, and mean absolute scale error. In most cases you will not be concerned with the specific values presented here. Rather you will use these numbers to compare different models. In this case your primary focus should be the mean absolute percentage error and mean absolute scale error as these measures refer to the size of the forecast error in your model. The closer each of these numbers is to zero, the better. The final browse tool provides an interactive report. This report also includes a chart of the forecast together with confidence bands. We can see that the historic data is depicted with a single gray line whereas the forecast is shown with the blue line. If we follow the blue line before the dashes we can see how the forecast would have performed historically. While it's reassuring for these two lines to be closely correlated be wary that this does not give you a false sense of security regarding your model's future accuracy. As the graph is not a straight line, it's fair to assume that the ARIMA model may be appropriate for our forecast. To recap here's what we did in this lesson. First, we split our dataset into estimation and validation sets.

We then ran the estimation set through the ARIMA tool.

Finally, we looked at the results of the ARIMA tool and given the information available determined that it may be appropriate for our forecast. At this point you may be wondering whether the ETS model will give us a better result. More importantly will the integrity of our model hold up when we apply it to our validation dataset? We'll examine these subjects in our next lesson.