3. Forming Sample Groups

 
Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial

Overview

In this lesson, we will partition the test candidates into sample groups in order to decide on the best treatment group.

Lesson Notes

Grouping Data

  • The Tile tool allows users to split the dataset into a specified number of groups of equal size

Transcript

In the previous lesson, we began the process of preparing the dataset of liquor stores for our AB trial. So far, we've eliminated stores with missing data from our analysis.

The goal in this lesson is to divide the remaining dataset into appropriate sample groups. We'll achieve this goal through three key steps. First, we'll import and format the high-level dataset to assist with group creation.

We'll then refine the dataset by removing data related to the outliers we identified in the previous lesson. Finally, we'll divide the stores into equally sized groups.

We want to compare the trial or treatment group with a similar group of stores operating under normal conditions known as the control group. Therefore, it's important that the treatment group is a statistically fair representation of the entire business. This will allow us to assume that any significant change in the treatment group would likely impact the business as a whole. The AB Treatments tool allows us to compare different possible sample groups according to specified criteria and output the group that fits the overall average best. To use this tool, we must first split our data into sub-groups.

We'll start by importing and formatting some relevant data to help with our analysis.

While the existing dataset contains sales and receipt information. Further data points will help us distinguish between each of our stores.

We'll import the high-level store dataset to supply this extra information.

We'll then connect both an auto-field tool and a select tool to the new dataset and run the workflow.

We can see that this dataset contains details regarding the format and product range of each store together with other information such as the distance to the nearest competitor store. Notice that the data type per store is an integer so we'll change it to v-string as we did before. Our next step is to refine this list of stores so that it excludes the stores with limited sales data that we previously identified.

To accomplish this, we'll bring a join tool onto the canvas connect the right input node to the high-level store dataset the left input node to the true output node of the filter tool and specify the join field as store.

We'll now run the workflow again.

The J output node now only contains the 104 stores that met our criteria in the container above.

We're now ready to move on to the next step and split this subset of stores into groups. We can use the tile tool from the preparation tab to help us with this task.

We'll bring the tile tool onto the canvas and connect it to the J output node of our join tool.

The Managing Director of Dan's Beverages has informed us that he would like to run the trial on a group of roughly 20 stores. We have 104 stores to choose from so we'll navigate to the configuration window and specify to split the stores into five tiles of equal records.

Let's use the high level store dataset to further refine these groups.

One of the fields in this dataset is store format.

Sam's has a few different formats that are used in various locations based on several factors such as shopper demographics and geography. We would like our sample group to include an equal share of these different types of store formats.

We can do this by specifying format in the grouping fields option of the configuration window. This will allows us to control for a host of different market factors in our AB test. We won't select any other fields for grouping as they don't contain any information that will be helpful in determining like stores. We'll now run the workflow and see that two new fields have been created. Tile number and tile sequence number.

Tile number signifies the group for each record. While tile sequence number signifies a record's position within that group. Now that we've separated our source into groups. Let's stop the lesson here. In the next lesson, we'll use the available data to determine the preferred treatment group.