2. Preparing Dataset for Grouping

Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial


Learn how to prepare a typical dataset for an AB Testing experiment, ensuring that only the most appropriate test candidates put forward for consideration.

Lesson Notes

The Switch Formula

  • The Switch Formula searches for specific items in a field and returns a user-defined value for each item identified, essentially switching one value out for another


We've been hired to advise Ben's Beverages in an extensive redesign project. The managing director would like to trial the new design across a portion of the business' 123 stores and assess the results after a six week period. The goal in this lesson is to determine which stores to use in the trial. We'll achieve this goal by following four key steps. First, we'll import historic receipt data and perform any necessary formatting. Next, we'll aggregate our data to the weekly level. Third, we'll summarize our data to identify any patterns.

Finally, we'll filter any outliers out of the data set.

We'll begin by connecting the historic receipt data set.

A quick review of this data, shows that this data set contains daily sales and receipt information for each store.

There are also flags telling us whether a store was open or closed on a specific day and whether it was running a special promotion or not.

Let's format this data by navigating to the Preparations tab, connecting both an auto field and select tool, and running the workflow.

We can see that the fields have been allocated various data types.

However, Store has been allocated an integer type.

AB Tools require identifying fields to be in a string format, so we'll change this field to a V String and run the workflow again.

We're now ready to move on to step two and aggregate our data.

Note that the data is presented in a daily format, however, we would like to do our analysis on a weekly level. We'll aggregate this data to the weekly level by bringing a Formula tool onto the work flow.

We'll navigate to the Configuration window and create a new field called "Week Date".

I'll paste in a formula that will return Monday's date for each seven day period associated with the transaction date.

I'll run the workflow and quickly go through what this formula actually does.

I'll explain this formula, starting from the inner most function and work my way out.

First, we use the Date Time format function to convert our date information into weekday information. We then use a Switch formula to assign a number for each day of the week, with Monday at zero and Sunday at six.

Finally, we use a Date Time Ad formula to subtract that weekday number from the transaction date, returning the date of the associated Monday.

At this point, we'll move on to step three and summarize our data to identify any high level patterns. We'll bring a Summarize tool onto the workflow, and aggregate by store.

We'll then do a distinct count of date and calculate both the minimum and maximum of week date.

By summarizing these fields, we can ensure that all our stores have data from the same date range. This is important as we'll need to compare like items for our AB test.

We'll now run the workflow again, and see that we have the total number of days of information for each store, together with the beginning and end dates.

We can investigate the data further by going to the Data Investigation tab, bringing a Field Summary tool onto the workflow, selecting all the fields for our summary, adding a Browse to the I Output Node, and running the workflow again.

If we maximize the Browse window, we can see that all stores have the same start and end date. However, notice that 19 stores only have information for 222 days, while the remaining stores have information for 406 days.

Remember, we want to compare like stores, so we'll need to exclude these 19 stores from our analysis.

We'll now move onto our final step and filter out these outliers. Before we do that, we'll use another Summarize tool to indicate which stores have insufficient data, and to separate out our data investigation from data manipulation. We'll bring another Summarize tool onto the canvas to create a parallel workflow.

Again, we'll group our data by store, perform a distinct count of date, and run the workflow again.

We'll then navigate to the Preparation tab, bring down a Filter tool, and specify the Filter by count distinct date, equal to 406.

We'll then navigate to the Join tab on the tools palate, bring a join tool onto the canvas, connecting it to both the True Output Node of the Filter tool and the week date formula.

We'll then navigate to the Configuration window, join on store, and run the workflow.

The J Output Node now contains the original data set, excluding the 19 stores with incomplete data. Let's stop the lesson here. In the next lesson, we'll refine the list of stores down to a small group for our redevelopment trial.