W9
Intermediate 3 sessions · 6 hours · Python

Week 9: Supervised Machine Learning

Topics covered: Decision trees (CART, Gini impurity), Random Forest (bagging, feature importance), cross-validation (k-fold stratified), hyperparameter tuning (GridSearchCV), scikit-learn Pipelines, SHAP values

Learning objectives: By the end of this week you will be able to apply supervised machine learning concepts to real datasets, write executable Python code for each technique, and complete both graded assignments independently.

Session 1: Decision Trees and Random Forests

CART builds a tree by recursively splitting data on the feature and threshold that minimises Gini impurity: Gini = 1 - sum(p_i^2). A pure node has Gini = 0. Random Forest trains an ensemble of trees on bootstrap samples, each considering only sqrt(p) features per split. Double randomisation produces diverse, uncorrelated trees. Final prediction is majority vote (classification) or mean (regression). Feature importance = mean decrease in impurity across all trees.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

rf = RandomForestClassifier(n_estimators=200, max_depth=8, min_samples_leaf=5,
                            random_state=42, n_jobs=-1)
rf.fit(X_tr, y_tr)

y_pred  = rf.predict(X_te)
y_proba = rf.predict_proba(X_te)[:,1]
print(f'AUC-ROC: {roc_auc_score(y_te, y_proba):.4f}')
print(classification_report(y_te, y_pred))

fi = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print('Top 5 features:')
print(fi.head(5))

Session 2: Cross-Validation and Hyperparameter Tuning

A single train-test split has high variance in performance estimates. K-fold CV splits data into k folds, trains on k-1 and tests on 1, repeating k times, then averages performance. Use stratified k-fold for imbalanced classification. GridSearchCV exhaustively tries every hyperparameter combination, trains k models per combination. scikit-learn Pipelines chain preprocessing and modelling, preventing data leakage (scaler must be fit on training data only).

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(random_state=42))
])

param_grid = {
    'rf__n_estimators': [100, 200],
    'rf__max_depth':    [5, 8, None],
    'rf__min_samples_leaf': [3, 5]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(pipeline, param_grid, cv=cv, scoring='roc_auc', n_jobs=-1)
grid.fit(X, y)

print(f'Best AUC-ROC: {grid.best_score_:.4f}')
print(f'Best params:  {grid.best_params_}')

Session 3: SHAP Values for Model Explainability

SHAP (SHapley Additive exPlanations) is grounded in cooperative game theory. For each prediction, the SHAP value for feature i measures how much feature i contributed to the difference between the model's prediction and the global mean. Positive SHAP pushes prediction higher. The sum of all SHAP values equals prediction minus global mean. Required for EU AI Act Article 22 compliance for automated decisions.

import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

rf = RandomForestClassifier(n_estimators=100, random_state=42).fit(X, y)

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X)

# Beeswarm plot: feature importance + direction of effect
shap.summary_plot(shap_values[1], X, plot_type='beeswarm')

# Explain one specific prediction
shap.waterfall_plot(shap.Explanation(
    values=shap_values[1][0],
    base_values=explainer.expected_value[1],
    data=X.iloc[0]
))

Week 9 Assignments

Submit completed notebooks to your GitHub repository before the next session. Feedback within 48 hours.

Train Logistic Regression, Random Forest, and XGBoost using Pipeline + StandardScaler. Compare with 5-fold stratified CV. Tune best model with GridSearchCV. Generate SHAP beeswarm plot. Write 300-word model selection justification.

Fraud detection on a Kaggle credit card dataset: address class imbalance with SMOTE, evaluate with precision-recall curve (explain why not ROC), achieve at least 90% recall on fraud class.

Previous WeekNext: Week 10