An End-to-end Project-based Approach to Teaching Data Mining Process

class: center, middle, inverse, title-slide

.title[
# An End-to-end Project-based Approach to Teaching Data Mining Process 
]
.subtitle[
## A Case Study in Credit Card Fraud Detection 
]
.author[
### Cheng Peng 
]
.institute[
### West Chester University of Pennsylvania 
]
.date[
### 05/14/2022 Presented at eCOTS 2022: Teaching Data Mining Slides available at: <a href="https://rpubs.com/cpeng/eCOTS2022" class="uri">https://rpubs.com/cpeng/eCOTS2022</a> AND <a href="https://pengdsci.github.io/eCOTS2022/" class="uri">https://pengdsci.github.io/eCOTS2022/</a>
]

---

class: inverse, middle
## Agenda

### Learning from Learning Theories

- Learning Theories

- Pedagogical Strategies

### Case-study: Credit Card Fraud Mining

- Fraud Background

- Analytic View of Fraud and Challenges

- Feature Extraction

- Analytic Fraud Identification Methods and Assessment

- Deployment and Automation

---
class: inverse, center, middle

# Learning from Learning Theories

---
class: center, middle

# Teaching DM Process vs Techniques

Cross-Industry Standard Process for Data Mining (CRISP-DM)

<center>Source: https://blogs.sap.com/2018/08/28/sap-machine-learning-approaching-your-project/</center>

---
# Learning from Learning Theories

There are many learning theories. They all fall under the three major theories.

-	**Behaviorism Learning Theory**: knowledge is independent and on the exterior of the learner. It focuses on the outside environment’s influences on learning.

-	**Cognitive Learning Theory**: processing information received rather than just responding to a stimulus as in behaviorism learning theory. It uses metacognition - “thinking about thinking”—to understand how thought processes influence learning .

-	**Constructivism Learning Theory**: constructing learning new ideas based on the prior knowledge and experiences through active engagement with the world (such as experiments or real-world problem solving)

---

# Some Principles of Constructivism Theory

I am a firm believer in constructivism learning theory.

-	Knowledge is constructed. This is the basic principle, meaning that knowledge is built upon the foundation of previous learning.

-	Learning is a social activity. Learning is something we do together, in interaction with each other, rather than an abstract concept.

-	There is no knowledge independent of the meaning attributed to experience (constructed) by the learner, or community of learners.

-	Learning is contextual: we do not learn isolated facts and theories that are separated from the rest of our lives.

-	Motivation is key to learning. Cognitive motivation is rooted in the availability of information and past experience/ prior knowledge.

---
# My Adopted Pedagogies in Teaching Analytics

-	Providing experience with the knowledge construction process - students determine how they will learn.

-	Providing experience in and appreciation for multiple perspectives - evaluation of alternative solutions.

-	Embedding learning in realistic contexts - authentic tasks.

-	Embedding learning in social experience – collaborative learning.

-	Encourage awareness of the knowledge construction process - reflection, metacognition.

-	Facilitate students to make sense of information presently available and in determining how to respond or relate to the current situation.

---
class: inverse, center, middle

# Case Study

### Credit Card Fraud Detection

---

# Adapted CRISP-DM for Fraud Mining Process

- **Data Preparation** - thinking of automation in the phase.

- **Modeling** - train / retrain models and algorithms according to the change in the fraud dynamics.

---

# Credit Card Transaction Process

<center>Source: https://business.chase.com/resources/start/a-crash-course-on-taking-the-mystery-out-of-payments </center>
---

# What is Credit Card Fraud?

**Credit card fraud** is a form of identity theft that involves an unauthorized taking of another’s credit card information for the purpose of charging purchases to the account or removing funds from it.

**Credit Card Fraud Types**: Credit card fraud schemes generally fall into one of two categories of fraud: application fraud and account takeover.

- Identity theft

- [Skimming Fraud (a kind of account takeover)](https://www.youtube.com/watch?v=G_aH50Tn8Fo)

**Why Combat Credit Fraud Loss**:  Card fraud over the next decade will cost the industry a collective $408.50 billion in losses globally, according to an annual report from the industry research firm Nilson Report.

---

# Fraud Data Generation Process & Availability

Pre-authorization: timestamp, geo-info of POS, Card information (card number, expiration date, billing address, security code)

Authorization: Pre-auth info + requested payment amount

Authentication: the issuing bank will

- verify the authorization information sent from the processor: validating card info and checking the availability of funds (credit line); and

- send the result of the authentication to the merchant: approval or denial.

- The merchant will send the complete transaction information to the issuing bank or the processor.

---

# Fraud Data Generation Process: A General Fraud Management System

---
#	Availability and Types of Data

Based on credit card processing and the general fraud detection system, The following information is available in different processing stages:

-	**Pre-authorization Data**
  + geo-information of point of sale (POS)  
  + time-stamp. 
  + card information.

-	**Authorization and Authentication Data**
  + pre-auth information
  + payment information

-	**Historical Data** 
  + complete transaction information. 
  + confirmed fraud (labels).
  + account information, etc.

- **Other Publicly Data**: crime rate, etc.

---

# Data Preparation - Collection

**Goal**: detect/identify fraudulent transactions.
 
**Challenges**:

- No information about fraudsters!

- Real-time detection.

- the rarity of fraud.

**What information is relevant?**

- Current transaction: card info, timestamp, amount, POS info.

- Historical transactions: timestamp, amount, POS info, fraud labels.

- Account information: Card holder’s info.

- Derived merchant site info (including publicly available info).

---
#	Creating Analytic Data According to Potential Analytic Methods

-	**Key Point**: Fraudulent activity alters genuine customers’ spending patterns!

-	**Cross-sectional Data**: current transactions.

-	**Longitudinal /Panel Data**: current and historical transactions

-	**Hybrid Cross-sectional and Longitudinal Data**: both current transactions and aggregated information of historical transactions

---
#	Types of Candidate Models/Algorithms

-	Business rules (expert system).

-	Supervised classification models/algorithms

+ handling the issue of the rarity of fraudulent transactions.
  + using fraud labels to train . 
  + fraud index as a predictor variable.
  + rare event logistic models.
  + penalized tree-based classification models/algorithms.

-	Unsupervised anomaly detection methods 
  + using the distribution of fraud index to detect fraud: high quantile along with operational constraints.
  + no need for handling imbalanced issues (fraud labels are not used)

-	Other probabilistic models/algorithms such as HMM.

---
#	Fraud Index Based on Historical Transactions

How fraudulent activities alter genuine customers’ spending patterns.

Modified from: https://neo4j.com/blog/fraud-prevention-neo4j-5-minute-overview/</center>

-	The transaction dollar amount is significantly different from that of genuine customers.

-	The genuine customers spending frequency will be changed.

-	The genuine customers’ transaction gap times (time between consecutive transactions) will be changed.

---
# What is Process Capability Index (PCI)?

Process capability compares the output of an in-control process to the specification limits by using capability indices.

-	If the PCI of a process is under a threshold, the process is incapable.
-	There are different PCIs for different processes.
-	USL and LSL need to be estimated from a portion of data.

---
class: inverse, center, middle

# A Numerical Example

### Data Layout, Candidate Models, and Algorithms

---

#	The “Capability” of Customers’ Spending Process – Fraud Index

Illustration: Defining a fraud index using historical payment dollar amounts.

---

# Pre-processed Data (Long Table)

---

# Data Matrix

---
class: inverse, center, middle

# A PCI-like Fraud Index Using Payment Amount

## 
`$$idx=\frac{(USL-\mu)^2}{9(\mbox{max} - \mu)^2+(T-\mu)^2}$$`

### USL, T: Estimated from the larger data.

### max, `$\mu$`: Estimated from the smaller data.

### Sample sizes of both data sets are tuning parameters

---

# How Fraud Index Works in Fraud Detection

---

# Distribution of Resulting Fraud Index

- The above figure indicates that the fraud index can be used as a standalone fraud detection algorithm with no structural parameters - an unsupervised anomaly detection.

---

# Performance Analysis

- Consideration of multivariate fraud index to incorporate gap time and spending frequency to boost the discriminatory power of the index.

---
# Supervised Algorithms and Models

The fraud index will be used as a feature variable.

Models and algorithms need to account for imbalance labels.

- Firth penalized logit models.

- King and Zeng's rare event logistic model.

- Qing's semi-parametric logistic model.

- penalized tree-based algorithm (including BAGGING. RF is not an option for this particular case).
 
- regular logit models based on over-/under sampled data.

- asymmetric-link GLMs.

---
class: inverse, center, middle

# Deployment / Monitoring and Updating

---
# Deployment Workflow & Improvement

- Deploying algorithms and models is only a component of the DM process.

- The Champion/challenger scheme in the real world DM systems.

- Continuous updating models/algorithms - retraining/retesting

- Importance of automation in the DM process.

---
class: inverse, center, middle

# Learning by Doing!

---
# Student's Project Ideas

There are many moving parts in the definition of the fraud index and the ways of using it. Even with the same data, students can build their projects using the combination of the following

- Methods of estimating USL and LSL

- One-sided fraud indexes?

- parametric and parametric indexes?

- Supervise methods using both labels and index

+ statistical models
  
  + machine learning algorithms

- Index as a standalone algorithm - high quantile decision boundary

+ parametric distribution of the indexes
  
  + non-parametric distribution of the indexes

---
class: inverse, center, middle

# Thanks!

Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).