Posts written by Mobilize Ops

Discovering Patterns in Transactional Data

As a data scientist, one of the most common tasks you’ll encounter is finding patterns and relationships within large datasets. Let’s consider an example from a supermarket setting. One powerful technique for achieving this is Apriori analysis, which is particularly useful for market basket (shopping cart) analysis and identifying frequent item sets in transactional data.

Supermarket Market Basket Analysis
Let’s consider an example from a supermarket setting. Suppose we have a dataset of customer transactions in which each transaction represents a customer’s shopping basket containing various items. The goal of Apriori analysis in this context is to identify sets of items that are frequently purchased together by customers. This information can be valuable for product placement, cross-selling strategies, and promotional campaigns. For instance, the Apriori algorithm might reveal that customers who purchase bread and butter also frequently purchase milk. This frequent item set could be represented as {bread, butter} → {milk}. By setting appropriate minimum support and confidence thresholds, the algorithm can identify such frequent item sets and generate association rules like:
{bread, butter} → {milk} (support = 0.3, confidence = 0.8)

This rule suggests that 30% of transactions contain bread, butter, and milk, and 80% of customers who bought bread and butter also bought milk. Armed with these insights, the supermarket can strategically place milk near the bread and butter sections, run promotions bundling these items together, or recommend milk to customers who have bread and butter in their baskets.

The Apriori Algorithm
Apriori analysis is a data mining technique used to uncover interesting relationships or associations between variables in a dataset. It operates on the principle of frequent itemset mining, which involves identifying sets of items that frequently appear together in a given dataset. The name “Apriori” comes from the fact that the algorithm uses prior knowledge of frequent item set properties to guide the search for larger item sets. In other words, it leverages the fact that if an item set is frequent, then all of its subsets must also be frequent.

The Apriori algorithm operates in two main steps:

Frequent Item set Generation: In this step, the algorithm identifies all item sets that satisfy a minimum support threshold. Support is a measure of how frequently an item set appears in the dataset.

Rule Generation: After identifying the frequent item sets, the algorithm generates association rules that satisfy a minimum confidence threshold. Confidence is a measure of how likely it is for the consequent to occur given the antecedent.

The algorithm iteratively generates candidate item sets of increasing length, prunes infrequent item sets, and calculates their support and confidence values until no more frequent item sets can be found.

Applications of Apriori Analysis
Apriori analysis has a wide range of applications, particularly in the following domains:
Market Basket Analysis: Identifying products that are frequently purchased together, which can inform product placement, cross-selling strategies, and promotional campaigns.
Web Usage Mining: Analyzing patterns in website clickstreams to understand user behavior and optimize website design and content.
Bioinformatics: Identifying co-occurring genes, proteins, or other biological entities that may be related or involved in similar processes.
Intrusion Detection: Identifying patterns of system calls or network traffic that may indicate malicious activity or security breaches.

Getting Started with Apriori Analysis
To get started with Apriori analysis, you’ll need a dataset containing transactional data or item sets. Many programming languages and data mining libraries, such as R’s arules package or Python’s mlxtend, provide implementations of the Apriori algorithm. Once you have your dataset and library set up, you can specify the minimum support and confidence thresholds, run the Apriori algorithm, and analyze the resulting frequent item sets and association rules. Apriori analysis is a powerful tool for uncovering hidden patterns and relationships in data, and it’s a valuable addition to any data scientist’s toolkit. With its wide range of applications and relatively straightforward implementation, it is an excellent technique to explore and master.

Why students need to understand and work with data

  • Develops critical thinking and analytical skills – Analyzing data requires students to ask questions, identify patterns, draw conclusions, and make informed decisions based on evidence.
  • Promotes data literacy – As data becomes increasingly prevalent in our data-driven world, students need to be able to interpret and communicate data effectively. Data literacy empowers students to make sense of information and use it to support arguments or solve real-world problems.
  • Data has permeated every industry and aspect of our lives. From healthcare and finance to marketing and education, data plays a pivotal role in driving decisions. And hence working with data and understanding it has become increasingly important.

Try it!
If you would like to try doing this analysis, you can download the Online Retail dataset here: https://archive.ics.uci.edu/dataset/352/online+retail. The code reference is linked below.

Code and Concept Reference:
https://www.datacamp.com/tutorial/market-basket-analysis-r

About the author:
Kunal Sonalkar is a data scientist at Nordstrom, the fashion retail company. He leverages machine learning techniques to improve the search retrieval experience and provide personalized product recommendations to online customers. He holds a master’s degree in computer science and engineering from the University of Florida.

Data Science Pipelines

A topic that comes up fairly regularly amongst data science professionals is the idea of pipelines. And I can imagine that all of the casual talk about pipes and pipelines probably makes it seem like data scientists are something more akin to plumbers than anything else… which wouldn’t be the worst characterization of the job I’ve ever heard.

Oftentimes, the goal of a data scientist is to build pipelines which might, for example:

  • Format raw data into datasets which can be quickly combined together and used for modeling purposes.
  • Build, train, and test a variety of models to identify which ones are most promising.
  • Deploy, monitor, and continually update models which are used for decision-making purposes.
  • Build these top three bullets together to create a seamless source of information that is readily available to make decisions.

What’s common throughout this (definitely non-exhaustive) list of pipelines is that data science is a field about building and evaluating processes, and I think one obvious question this brings up is, “How do we prepare and train students or young/new data scientists to build these types of processes?” We prepare them by teaching them to be critical thinkers and consumers of data first.

Teaching critical data thinkers

Training students and/or new data scientists is a pipeline problem in and of itself. Ideally, we’d get students from diverse backgrounds and perspectives interested in the field, give them a sample of the field to drive their interest, and once we’ve “hooked” them, motivate them to acquire specialized training at universities or through online coding programs. Why then is learning to be a critical thinker with data so important? Because it’s the thought process which underlies all successful data science pipelines.

People who are trained to think critically about data will spend more time thinking about how data has been sourced, who might be represented in such data, as well as who might not be represented. These thoughts can then guide the assessment or value of new data sources, decisions regarding how to format it, or how to represent that information as a data source.

Teaching students to be critical with data also teaches them how to represent or summarize information so that it’s honestly and easily interpreted, skills that are entirely necessary when it comes to evaluating competing models, monitoring model performance over time, or even just justifying business decisions to non-data experts.

The IDS to DS pipeline

One of the things I have always loved about the Introduction to Data Science high school math curriculum has been that critical thinking about data has always come first. Students get experience and exposure to data topics that are as relevant today to data scientists as they were when the curriculum was initially written. Then they get to experience, in an authentic and meaningful way,  how data scientists apply these critical thinking skills via writing code.  

Will students interested in a career in data science, at some point, need to learn lots of math, statistics, probability, and calculus? Without a doubt, just like students need more than one biology high school class if they want to become doctors. So, is getting a PhD in statistics a necessary step before we can start getting people interested in data science? Absolutely not. In fact, I would argue that giving students a glimpse of what lies at the end of a mathematics pipeline guides more students into the field than trying to piece together an existing pipeline which is already leaking.

Data science is a great career which has already benefited immensely from data scientists coming into the field with diverse backgrounds such as economics, physics, computer science, mathematics, and more. My hope is that, with courses like IDS, we’ll continue to bring in new data scientists with different views and perspectives as we continue to grow the field into the future.

About the author:

Dr. James Molyneux is a data scientist for Swyfft, LLC, where he specializes in evaluating/developing new data sources and building risk/underwriting models and workflows. He is also courtesy faculty in the Department of Mathematics at Oregon State University.