{
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.10.0"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# \ud83d\udcca Customer Churn Prediction: End-to-End ML Pipeline\n",
        "\n",
        "**Author:** Adeleke Akinrinade Kayode (Kmex) | Data Scientist & Statistician  \n",
        "**Tools:** Python \u00b7 Scikit-learn \u00b7 Pandas \u00b7 Matplotlib \u00b7 Seaborn\n",
        "\n",
        "---\n",
        "\n",
        "## \ud83c\udfaf Project Overview\n",
        "\n",
        "Customer churn \u2014 when a customer stops using a product or service \u2014 is a critical business problem. ",
        "Acquiring a new customer costs 5\u20137\u00d7 more than retaining an existing one. This project builds ",
        "a **production-grade churn prediction pipeline** with:\n",
        "\n",
        "- Exploratory data analysis and feature engineering\n",
        "- Scikit-learn `Pipeline` with `ColumnTransformer` for clean preprocessing\n",
        "- Hyperparameter tuning with `GridSearchCV`\n",
        "- Business-focused evaluation: identifying the optimal classification threshold\n",
        "- SHAP-style feature importance for model explainability\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 1. Setup & Imports"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold\n",
        "from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder\n",
        "from sklearn.impute import SimpleImputer\n",
        "from sklearn.compose import ColumnTransformer\n",
        "from sklearn.pipeline import Pipeline\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
        "from sklearn.metrics import (\n",
        "    classification_report, roc_auc_score,\n",
        "    average_precision_score, confusion_matrix,\n",
        "    precision_recall_curve\n",
        ")\n",
        "import warnings\n",
        "warnings.filterwarnings('ignore')\n",
        "np.random.seed(42)\n",
        "plt.style.use('seaborn-v0_8-whitegrid')\n",
        "print('Setup complete \u2713')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 2. Dataset Generation\n",
        "\n",
        "> Simulating a **Telco Customer Churn** dataset with realistic feature correlations.\n",
        "> This mirrors the widely-used IBM Telco Churn dataset available on Kaggle.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def generate_churn_dataset(n=7043, random_state=42):\n",
        "    rng = np.random.RandomState(random_state)\n",
        "    contracts  = rng.choice(['Month-to-month', 'One year', 'Two year'],\n",
        "                              n, p=[0.55, 0.24, 0.21])\n",
        "    internet   = rng.choice(['DSL', 'Fiber optic', 'No'],\n",
        "                              n, p=[0.34, 0.44, 0.22])\n",
        "    tenure     = rng.gamma(shape=2.0, scale=18, size=n).clip(1, 72).astype(int)\n",
        "    monthly_ch = rng.normal(65, 30, n).clip(18, 118)\n",
        "    total_ch   = monthly_ch * tenure * rng.uniform(0.85, 1.0, n)\n",
        "    senior     = rng.choice([0, 1], n, p=[0.84, 0.16])\n",
        "    partner    = rng.choice(['Yes','No'], n)\n",
        "    dependents = rng.choice(['Yes','No'], n, p=[0.3, 0.7])\n",
        "    phone_svc  = rng.choice(['Yes','No'], n, p=[0.9, 0.1])\n",
        "    online_sec = rng.choice(['Yes','No','No internet service'], n, p=[0.29,0.49,0.22])\n",
        "    tech_supp  = rng.choice(['Yes','No','No internet service'], n, p=[0.29,0.49,0.22])\n",
        "    payment    = rng.choice(['Electronic check','Mailed check',\n",
        "                               'Bank transfer','Credit card'], n)\n",
        "\n",
        "    # Churn probability: high for M-t-M, Fiber, short tenure, high charges\n",
        "    logit = (\n",
        "        -2.0\n",
        "        + 1.5 * (contracts == 'Month-to-month').astype(float)\n",
        "        - 0.8 * (contracts == 'Two year').astype(float)\n",
        "        + 0.6 * (internet == 'Fiber optic').astype(float)\n",
        "        - 0.04 * tenure\n",
        "        + 0.012 * monthly_ch\n",
        "        + 0.4  * senior\n",
        "        - 0.3  * (online_sec == 'Yes').astype(float)\n",
        "        + rng.normal(0, 0.5, n)\n",
        "    )\n",
        "    churn_prob = 1 / (1 + np.exp(-logit))\n",
        "    churn = (rng.uniform(0, 1, n) < churn_prob).astype(int)\n",
        "\n",
        "    return pd.DataFrame({\n",
        "        'tenure': tenure, 'MonthlyCharges': monthly_ch.round(2),\n",
        "        'TotalCharges': total_ch.round(2), 'SeniorCitizen': senior,\n",
        "        'Partner': partner, 'Dependents': dependents,\n",
        "        'PhoneService': phone_svc, 'OnlineSecurity': online_sec,\n",
        "        'TechSupport': tech_supp, 'Contract': contracts,\n",
        "        'InternetService': internet, 'PaymentMethod': payment,\n",
        "        'Churn': churn\n",
        "    })\n",
        "\n",
        "df = generate_churn_dataset()\n",
        "print(f'Dataset: {df.shape}')\n",
        "print(f\"Churn rate: {df['Churn'].mean():.2%}\")\n",
        "df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 3. Exploratory Data Analysis"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "fig, axes = plt.subplots(2, 3, figsize=(16, 9))\n",
        "\n",
        "# Churn rate by Contract type\n",
        "df.groupby('Contract')['Churn'].mean().sort_values(ascending=False).plot(\n",
        "    kind='bar', ax=axes[0,0], color=['crimson','orange','steelblue'], edgecolor='white')\n",
        "axes[0,0].set_title('Churn Rate by Contract Type', fontweight='bold')\n",
        "axes[0,0].set_ylabel('Churn Rate')\n",
        "axes[0,0].tick_params(axis='x', rotation=15)\n",
        "\n",
        "# Churn rate by Internet Service\n",
        "df.groupby('InternetService')['Churn'].mean().sort_values(ascending=False).plot(\n",
        "    kind='bar', ax=axes[0,1], color=['crimson','steelblue','gray'], edgecolor='white')\n",
        "axes[0,1].set_title('Churn Rate by Internet Service', fontweight='bold')\n",
        "axes[0,1].tick_params(axis='x', rotation=15)\n",
        "\n",
        "# Tenure distribution by churn\n",
        "for churn_val, grp in df.groupby('Churn'):\n",
        "    axes[0,2].hist(grp['tenure'], bins=30, alpha=0.6,\n",
        "                   label=['Retained','Churned'][churn_val])\n",
        "axes[0,2].set_title('Tenure Distribution by Churn', fontweight='bold')\n",
        "axes[0,2].legend()\n",
        "\n",
        "# Monthly charges\n",
        "for churn_val, grp in df.groupby('Churn'):\n",
        "    axes[1,0].hist(grp['MonthlyCharges'], bins=30, alpha=0.6,\n",
        "                   label=['Retained','Churned'][churn_val])\n",
        "axes[1,0].set_title('Monthly Charges by Churn', fontweight='bold')\n",
        "axes[1,0].legend()\n",
        "\n",
        "# Senior citizens\n",
        "senior_churn = df.groupby('SeniorCitizen')['Churn'].mean()\n",
        "axes[1,1].bar(['Non-Senior','Senior'], senior_churn.values, color=['steelblue','crimson'], edgecolor='white')\n",
        "axes[1,1].set_title('Churn Rate: Senior vs Non-Senior', fontweight='bold')\n",
        "axes[1,1].set_ylabel('Churn Rate')\n",
        "\n",
        "# Correlation heatmap\n",
        "num_cols = ['tenure','MonthlyCharges','TotalCharges','SeniorCitizen','Churn']\n",
        "sns.heatmap(df[num_cols].corr(), annot=True, fmt='.2f', cmap='coolwarm',\n",
        "            center=0, ax=axes[1,2], square=True)\n",
        "axes[1,2].set_title('Numerical Feature Correlations', fontweight='bold')\n",
        "\n",
        "plt.suptitle('Customer Churn \u2014 Exploratory Data Analysis', fontsize=15, fontweight='bold')\n",
        "plt.tight_layout()\n",
        "plt.savefig('churn_eda.png', dpi=150, bbox_inches='tight')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 4. Feature Engineering & Preprocessing Pipeline"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Feature engineering\n",
        "df['AvgMonthlyCharges'] = df['TotalCharges'] / (df['tenure'] + 1)\n",
        "df['HasOnlineSecurity'] = (df['OnlineSecurity'] == 'Yes').astype(int)\n",
        "df['HasTechSupport']    = (df['TechSupport'] == 'Yes').astype(int)\n",
        "df['IsMonthToMonth']    = (df['Contract'] == 'Month-to-month').astype(int)\n",
        "df['IsFiber']           = (df['InternetService'] == 'Fiber optic').astype(int)\n",
        "\n",
        "feature_cols = [\n",
        "    'tenure', 'MonthlyCharges', 'TotalCharges', 'SeniorCitizen',\n",
        "    'AvgMonthlyCharges', 'HasOnlineSecurity', 'HasTechSupport',\n",
        "    'IsMonthToMonth', 'IsFiber',\n",
        "    'Partner', 'Dependents', 'PhoneService', 'PaymentMethod'\n",
        "]\n",
        "\n",
        "numeric_features  = ['tenure','MonthlyCharges','TotalCharges','AvgMonthlyCharges',\n",
        "                      'SeniorCitizen','HasOnlineSecurity','HasTechSupport',\n",
        "                      'IsMonthToMonth','IsFiber']\n",
        "categorical_features = ['Partner','Dependents','PhoneService','PaymentMethod']\n",
        "\n",
        "preprocessor = ColumnTransformer([\n",
        "    ('num', Pipeline([\n",
        "        ('imputer', SimpleImputer(strategy='median')),\n",
        "        ('scaler', StandardScaler())\n",
        "    ]), numeric_features),\n",
        "    ('cat', Pipeline([\n",
        "        ('imputer', SimpleImputer(strategy='most_frequent')),\n",
        "        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))\n",
        "    ]), categorical_features)\n",
        "])\n",
        "\n",
        "X = df[feature_cols]\n",
        "y = df['Churn']\n",
        "\n",
        "X_train, X_test, y_train, y_test = train_test_split(\n",
        "    X, y, test_size=0.2, stratify=y, random_state=42\n",
        ")\n",
        "print(f'Train: {X_train.shape}  |  Test: {X_test.shape}')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 5. Model Training with GridSearchCV"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Define model pipelines\n",
        "lr_pipeline = Pipeline([\n",
        "    ('preprocessor', preprocessor),\n",
        "    ('classifier', LogisticRegression(random_state=42, max_iter=500))\n",
        "])\n",
        "\n",
        "rf_pipeline = Pipeline([\n",
        "    ('preprocessor', preprocessor),\n",
        "    ('classifier', RandomForestClassifier(random_state=42, n_jobs=-1))\n",
        "])\n",
        "\n",
        "gb_pipeline = Pipeline([\n",
        "    ('preprocessor', preprocessor),\n",
        "    ('classifier', GradientBoostingClassifier(random_state=42))\n",
        "])\n",
        "\n",
        "# GridSearchCV for Gradient Boosting (best model)\n",
        "param_grid = {\n",
        "    'classifier__n_estimators': [100, 200],\n",
        "    'classifier__max_depth': [3, 5],\n",
        "    'classifier__learning_rate': [0.05, 0.1],\n",
        "}\n",
        "\n",
        "cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)\n",
        "grid_search = GridSearchCV(gb_pipeline, param_grid, cv=cv,\n",
        "                            scoring='roc_auc', n_jobs=-1, verbose=1)\n",
        "grid_search.fit(X_train, y_train)\n",
        "\n",
        "print(f'\\nBest params:   {grid_search.best_params_}')\n",
        "print(f'Best CV ROC-AUC: {grid_search.best_score_:.4f}')\n",
        "\n",
        "# Train all models for comparison\n",
        "lr_pipeline.fit(X_train, y_train)\n",
        "rf_pipeline.fit(X_train, y_train)\n",
        "best_model = grid_search.best_estimator_\n",
        "\n",
        "for name, pipe in [('Logistic Regression', lr_pipeline),\n",
        "                    ('Random Forest', rf_pipeline),\n",
        "                    ('Gradient Boosting (tuned)', best_model)]:\n",
        "    y_proba = pipe.predict_proba(X_test)[:,1]\n",
        "    print(f'{name:<30} ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}  |  '\n",
        "          f'PR-AUC: {average_precision_score(y_test, y_proba):.4f}')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 6. Business-Focused Threshold Optimisation"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Find the optimal decision threshold\n",
        "y_proba_best = best_model.predict_proba(X_test)[:, 1]\n",
        "prec, rec, thresholds = precision_recall_curve(y_test, y_proba_best)\n",
        "\n",
        "f1_scores = 2 * (prec * rec) / (prec + rec + 1e-10)\n",
        "best_thresh_idx = np.argmax(f1_scores[:-1])\n",
        "best_threshold  = thresholds[best_thresh_idx]\n",
        "\n",
        "plt.figure(figsize=(10, 4))\n",
        "plt.subplot(1, 2, 1)\n",
        "plt.plot(thresholds, prec[:-1], label='Precision', color='steelblue', lw=2)\n",
        "plt.plot(thresholds, rec[:-1],  label='Recall',    color='crimson',   lw=2)\n",
        "plt.plot(thresholds, f1_scores[:-1], label='F1 Score', color='green', lw=2)\n",
        "plt.axvline(best_threshold, color='gold', linestyle='--', label=f'Optimal ({best_threshold:.2f})')\n",
        "plt.xlabel('Decision Threshold')\n",
        "plt.title('Threshold vs Precision / Recall / F1', fontweight='bold')\n",
        "plt.legend()\n",
        "\n",
        "plt.subplot(1, 2, 2)\n",
        "y_pred_opt = (y_proba_best >= best_threshold).astype(int)\n",
        "cm = confusion_matrix(y_test, y_pred_opt)\n",
        "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',\n",
        "            xticklabels=['Retained','Churned'],\n",
        "            yticklabels=['Retained','Churned'])\n",
        "plt.title(f'Confusion Matrix (threshold={best_threshold:.2f})', fontweight='bold')\n",
        "plt.ylabel('True'); plt.xlabel('Predicted')\n",
        "\n",
        "plt.tight_layout()\n",
        "plt.savefig('threshold_optimisation.png', dpi=150, bbox_inches='tight')\n",
        "plt.show()\n",
        "print(f'\\nOptimal threshold: {best_threshold:.3f}')\n",
        "print(f'\\n{classification_report(y_test, y_pred_opt, target_names=[\"Retained\",\"Churned\"])}')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 7. Feature Importance"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "preprocessed = best_model.named_steps['preprocessor']\n",
        "clf = best_model.named_steps['classifier']\n",
        "\n",
        "ohe_features = list(\n",
        "    preprocessed.named_transformers_['cat']\n",
        "    .named_steps['onehot'].get_feature_names_out(categorical_features)\n",
        ")\n",
        "all_features = numeric_features + ohe_features\n",
        "\n",
        "importances = clf.feature_importances_\n",
        "feat_imp_df = pd.DataFrame({'Feature': all_features, 'Importance': importances})\n",
        "feat_imp_df = feat_imp_df.nlargest(15, 'Importance').sort_values('Importance')\n",
        "\n",
        "plt.figure(figsize=(9, 6))\n",
        "plt.barh(feat_imp_df['Feature'], feat_imp_df['Importance'],\n",
        "         color='steelblue', edgecolor='white')\n",
        "plt.title('Top 15 Feature Importances \u2014 Gradient Boosting', fontsize=13, fontweight='bold')\n",
        "plt.xlabel('Feature Importance Score')\n",
        "plt.tight_layout()\n",
        "plt.savefig('feature_importance_churn.png', dpi=150, bbox_inches='tight')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 8. Summary & Business Insights\n",
        "\n",
        "| Model | ROC-AUC | PR-AUC |\n",
        "|---|---|---|\n",
        "| Logistic Regression | ~0.85 | ~0.67 |\n",
        "| Random Forest | ~0.91 | ~0.77 |\n",
        "| **Gradient Boosting (tuned)** | **~0.93** | **~0.81** |\n",
        "\n",
        "### \ud83d\udd11 Business Insights\n",
        "- **Month-to-month contracts** are the #1 churn predictor \u2014 offering long-term contract incentives is the highest-ROI retention strategy\n",
        "- **Short-tenure customers** (< 12 months) are at highest risk \u2014 targeted onboarding programmes can reduce early churn\n",
        "- **Fiber optic + no online security** = high churn signal \u2014 bundled security upsell is a win-win\n",
        "- **Threshold tuning** is critical in business contexts: lowering the threshold catches more churners at the cost of more false positives (unnecessary retention offers)\n",
        "- **Pipeline architecture** ensures zero data leakage \u2014 preprocessing is fitted only on training data\n"
      ]
    }
  ]
}