Machine Learning2024github.com/kmexa

Credit Card Fraud Detection: End-to-End ML Pipeline

End-to-end fraud detection pipeline mirroring real-world AML system challenges: extreme class imbalance (~1.7% fraud rate), anonymised PCA features, and the need to minimise false negatives without overwhelming analysts with false positives.

Download Notebook (.ipynb) View on GitHub

Methodology

01Exploratory data analysis: distributional differences between fraud and legitimate transactions in PCA-feature space, class imbalance visualisation
02Preprocessing: StandardScaler on Amount and Time, 80/20 stratified train-test split to preserve fraud class ratio
03SMOTE oversampling applied to training set only, preventing data leakage into the test fold
04Three-model benchmark: Logistic Regression (baseline), Random Forest, XGBoost under identical conditions
05XGBoost selected for production; feature importance reveals V1-V4 and Amount as strongest fraud predictors
06PR-AUC used as the primary evaluation metric because ROC-AUC is misleadingly high for severely imbalanced datasets

Results

0.983
ROC-AUC
0.871
PR-AUC
0.87
Fraud Class F1
ModelROC-AUCPR-AUC
Logistic Regression0.9210.647
Random Forest0.9710.823
XGBoost (selected)0.9830.871
XGBoostRandom ForestSMOTEscikit-learnimbalanced-learnpandasPython

More Portfolio Projects

Fraud Detection NLP Spam Classifier Customer Churn News Classifier TreasuryIQ