Machine Learning2024github.com/kmexa

Credit Card Fraud Detection: End-to-End ML Pipeline

End-to-end fraud detection pipeline mirroring real-world AML system challenges: extreme class imbalance (~1.7% fraud rate), anonymised PCA features, and the need to minimise false negatives without overwhelming analysts with false positives.

Download Notebook (.ipynb) View on GitHub

Methodology

01Exploratory data analysis: distributional differences between fraud and legitimate transactions in PCA-feature space, class imbalance visualisation

02Preprocessing: StandardScaler on Amount and Time, 80/20 stratified train-test split to preserve fraud class ratio

03SMOTE oversampling applied to training set only, preventing data leakage into the test fold

04Three-model benchmark: Logistic Regression (baseline), Random Forest, XGBoost under identical conditions

05XGBoost selected for production; feature importance reveals V1-V4 and Amount as strongest fraud predictors

06PR-AUC used as the primary evaluation metric because ROC-AUC is misleadingly high for severely imbalanced datasets

Results

0.983

ROC-AUC

0.871

PR-AUC

0.87

Fraud Class F1

Model	ROC-AUC	PR-AUC
Logistic Regression	0.921	0.647
Random Forest	0.971	0.823
XGBoost (selected)	0.983	0.871

XGBoostRandom ForestSMOTEscikit-learnimbalanced-learnpandasPython

Download fraud_detection.ipynb Request Similar Project

Credit Card Fraud Detection: End-to-End ML Pipeline

Methodology

Results

More Portfolio Projects