9. Market Basket Affinity and Item Co-occurence

Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial


Cut Price Supermarkets would like us to analyze their receipt data and visualize item co-occurrence. As a first step, we will deploy the MB Affinity tool to create a matrix of item co-occurrences.


MB Affinity Tool

  • The MB Affinity tool takes in transaction data and constructs a matrix that shows how many times any two items appear on the same receipt
  •  There are several different options for determine the affinity between the two items in question – for more information, follow this link


As mentioned in the previous lesson, Cut Price Super Markets would like us to perform a market basket analysis on the receipt data to determine how to better target specific customer segments. Specifically, they want to see how sales to male and female customers differ during the weekend and the weekdays. Note that Cut Price views the weekend as Friday through Sunday and weekdays as Monday through Thursday. Our goal in this lesson is to segment our data and create a matrix of item co-occurrences. We'll achieve this goal through three key steps.

First, we'll use formula and filter tools to segment our data by gender and day of the week. Next, we'll calculate the co-occurrence or likelihood that a receipt will contain any two item pairs.

As a final step, we'll develop the work streams for different gender and weekday combinations. Note that this data set has already been cleaned with several different data preparation tools. I won't go into detail here since we've done this many times before. But do feel free to look at the formatting changes if needed.

If you look at the preview window, we can see that this data contains gender and age group information for each customer. It also contains associated receipts with items broken down into three category levels, the number of items purchased, and total spend by line item. Cut Price Super Markets would like us to focus on category three level items for this analysis, as these items offer the most granular look at customer behavior.

Our first step is to segment our data based on gender and day of the week. Our data contains date information, but has not currently identified a day of the week for any particular transaction. We'll add these weekdays by using a switch formula. We'll connect a formula tool and create a new field called Day of Week.

I'll paste in the formula, and quickly run through what it does. The first section searches through the date field, and returns the abbreviated day of the week. The switch is then used to return the number for each day.

Zero for Monday, one for Tuesday, et cetera. Next, we'll bring down a filter tool.

And create a basic filter for gender equal to female.

You could also choose male here, as we'll create a branch for each gender. We'll then connect a new filter tool to the true node.

And choose to filter by day of week is less than four.

This will pull out receipts for Cut Price's weekdays. Mondays through Thursdays only. We'll now run the workflow.

We can see that we have subset of data for receipts from female customers shopping on weekdays.

We're now ready to move on to step two and calculate the co-occurrence, Or the likelihood that a receipt will contain any two specific items. Before we run our co-occurrence analysis, we should strip out any unnecessary fields, such as the higher level categories. To accomplish this, we'll connect a summarize tool and group by both receipt ID and category level three item.

We'll then run the workflow.

We'll now navigate to the predictive grouping tab on the tools pallet, and connect a market basket affinity tool. The MB affinity tool has a single input, focusing on two fields. The transaction field, in this case, receipt ID, and the field containing the item identifier, or category three.

We now have three measures to choose from. We'll choose co-occurrence, which will simply count the number of transactions where both items are present. The other measures are cosign distance and cosign similarity.

Both of these measures return numbers between zero and one to signify similarity for each pair.

We'll now add a browse tool and run the workflow.

In the results window, we are presented with a 377 by 377 matrix telling us how often different categories appear on the same receipt. For example, we can see that 25 receipts contain the line item, apples, pears, rhubarb, and the line item all purpose cleaner's wipes. The figure at the intersection of two of the same items, i.e. air fresheners and air fresheners, are instances where there was just a single line item on a receipt.

Let's copy these tools and create another work stream connected to the T node of the filter tool.

We'll then change the day filter here to focus on weekends or Friday through Sunday. We'll now copy these two work streams, paste them, and connect them to the F Node of the filter tool to see co-occurrence for male shoppers on weekdays and weekends.

We'll run the workflow again to develop matrices for each work stream.

Let's stop the lesson here. The information at the end of each of these work streams is useful in its own right and can help managers develop personalized marketing campaigns. However, the current format is a bit unmanageable. In the next lesson, we'll reorganize this data and prepare for visualization in Tableau.