{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f5a28603",
   "metadata": {},
   "source": [
    "# News Article Text Classification using NLP\n",
    "\n",
    "**Author:** Adeleke Akinrinade Kayode (kmexa)  \n",
    "**Background:** This project applies NLP techniques to multi-class text classification.\n",
    "The pipeline structure mirrors workflows used in policy document analysis and\n",
    "automated report categorisation — both relevant to consulting and research contexts.\n",
    "\n",
    "## Sections\n",
    "1. Imports and Setup\n",
    "2. Data Loading and Exploration\n",
    "3. Text Preprocessing\n",
    "4. TF-IDF Vectorisation\n",
    "5. Model Training\n",
    "6. Evaluation\n",
    "7. Keyword Analysis\n",
    "8. Summary\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce372356",
   "metadata": {},
   "source": [
    "## 1. Imports and Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "acad23b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Imports for text processing, feature extraction, modelling, and evaluation\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import re\n",
    "import string\n",
    "import warnings\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "from sklearn.datasets import fetch_20newsgroups\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import LinearSVC\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.metrics import (\n",
    "    classification_report, confusion_matrix,\n",
    "    accuracy_score, f1_score\n",
    ")\n",
    "\n",
    "warnings.filterwarnings('ignore')\n",
    "sns.set_style('whitegrid')\n",
    "np.random.seed(42)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb81df47",
   "metadata": {},
   "source": [
    "## 2. Data Loading and Exploration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "157ec517",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load six topically distinct categories from the 20 Newsgroups dataset.\n",
    "# Removing headers, footers, and quoted replies prevents the model from\n",
    "# learning metadata patterns instead of actual content.\n",
    "\n",
    "categories = [\n",
    "    'rec.sport.hockey',\n",
    "    'sci.med',\n",
    "    'sci.space',\n",
    "    'talk.politics.guns',\n",
    "    'comp.graphics',\n",
    "    'rec.autos'\n",
    "]\n",
    "\n",
    "label_map = {\n",
    "    'rec.sport.hockey': 'Sports - Hockey',\n",
    "    'sci.med': 'Science - Medicine',\n",
    "    'sci.space': 'Science - Space',\n",
    "    'talk.politics.guns': 'Politics - Guns',\n",
    "    'comp.graphics': 'Computing - Graphics',\n",
    "    'rec.autos': 'Automotive'\n",
    "}\n",
    "\n",
    "data = fetch_20newsgroups(\n",
    "    subset='all',\n",
    "    categories=categories,\n",
    "    remove=('headers', 'footers', 'quotes'),\n",
    "    random_state=42\n",
    ")\n",
    "\n",
    "df = pd.DataFrame({'text': data.data, 'label': data.target})\n",
    "df['category'] = df['label'].map({i: label_map[c] for i, c in enumerate(categories)})\n",
    "df = df[df['text'].str.strip().str.len() > 50].reset_index(drop=True)\n",
    "\n",
    "print(f\"Total documents: {len(df):,}\")\n",
    "print()\n",
    "print(df['category'].value_counts())\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e8a3ae2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Document count per category\n",
    "plt.figure(figsize=(10, 5))\n",
    "counts = df['category'].value_counts()\n",
    "plt.bar(counts.index, counts.values,\n",
    "        color=sns.color_palette('muted', len(counts)),\n",
    "        edgecolor='black')\n",
    "plt.title('Document Count by Category')\n",
    "plt.ylabel('Number of Documents')\n",
    "plt.xticks(rotation=25, ha='right')\n",
    "for i, v in enumerate(counts.values):\n",
    "    plt.text(i, v + 8, str(v), ha='center')\n",
    "plt.tight_layout()\n",
    "plt.savefig('category_distribution.png', dpi=120, bbox_inches='tight')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08e53cac",
   "metadata": {},
   "source": [
    "## 3. Text Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "656643c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clean raw text: lowercase, remove digits and punctuation, strip stopwords.\n",
    "# Short tokens (under 3 characters) are also removed as they add noise.\n",
    "\n",
    "def preprocess(text):\n",
    "    text = text.lower()\n",
    "    text = re.sub(r'\\d+', '', text)\n",
    "    text = text.translate(str.maketrans('', '', string.punctuation))\n",
    "    tokens = text.split()\n",
    "    tokens = [t for t in tokens if t not in ENGLISH_STOP_WORDS and len(t) > 2]\n",
    "    return ' '.join(tokens)\n",
    "\n",
    "df['text_clean'] = df['text'].apply(preprocess)\n",
    "\n",
    "print(\"Raw text sample:\")\n",
    "print(df['text'].iloc[2][:300])\n",
    "print()\n",
    "print(\"After preprocessing:\")\n",
    "print(df['text_clean'].iloc[2][:300])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "729d97a2",
   "metadata": {},
   "source": [
    "## 4. TF-IDF Vectorisation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c23735f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train/test split stratified by category\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    df['text_clean'], df['label'],\n",
    "    test_size=0.2, stratify=df['label'], random_state=42\n",
    ")\n",
    "\n",
    "print(f\"Training documents: {len(X_train):,}\")\n",
    "print(f\"Test documents:     {len(X_test):,}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8bdcfd5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TF-IDF with bigrams captures two-word phrases like \"space shuttle\" or \"gun control\".\n",
    "# 15,000 features is a reasonable upper bound for this corpus size.\n",
    "\n",
    "vectoriser = TfidfVectorizer(max_features=15000, ngram_range=(1, 2))\n",
    "X_train_tfidf = vectoriser.fit_transform(X_train)\n",
    "X_test_tfidf  = vectoriser.transform(X_test)\n",
    "\n",
    "print(f\"Feature matrix shape: {X_train_tfidf.shape}\")\n",
    "print(f\"Sparsity: {(1 - X_train_tfidf.nnz / (X_train_tfidf.shape[0] * X_train_tfidf.shape[1])):.1%}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fbfac54c",
   "metadata": {},
   "source": [
    "## 5. Model Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d362e4b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Three classifiers commonly used for text: Naive Bayes (probabilistic baseline),\n",
    "# Logistic Regression (interpretable linear model), and Linear SVM\n",
    "# (well-suited for high-dimensional sparse feature spaces like TF-IDF).\n",
    "\n",
    "models = {\n",
    "    'Naive Bayes':        MultinomialNB(alpha=0.1),\n",
    "    'Logistic Regression': LogisticRegression(max_iter=1000, C=5, random_state=42),\n",
    "    'Linear SVM':          LinearSVC(C=1.0, max_iter=2000, random_state=42)\n",
    "}\n",
    "\n",
    "results = {}\n",
    "for name, model in models.items():\n",
    "    model.fit(X_train_tfidf, y_train)\n",
    "    y_pred = model.predict(X_test_tfidf)\n",
    "    results[name] = {\n",
    "        'model': model,\n",
    "        'y_pred': y_pred,\n",
    "        'accuracy': accuracy_score(y_test, y_pred),\n",
    "        'f1': f1_score(y_test, y_pred, average='weighted')\n",
    "    }\n",
    "    print(f\"{name:<25}  Accuracy: {results[name]['accuracy']:.4f}  |  Weighted F1: {results[name]['f1']:.4f}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d731b445",
   "metadata": {},
   "source": [
    "## 6. Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68c55f18",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Classification report and confusion matrix for the best model\n",
    "best_name = max(results, key=lambda k: results[k]['f1'])\n",
    "best_pred = results[best_name]['y_pred']\n",
    "\n",
    "category_labels = list(label_map.values())\n",
    "\n",
    "print(f\"Best model: {best_name}\")\n",
    "print()\n",
    "print(classification_report(y_test, best_pred, target_names=category_labels))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96808f38",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confusion matrix visualisation\n",
    "plt.figure(figsize=(9, 7))\n",
    "cm = confusion_matrix(y_test, best_pred)\n",
    "sns.heatmap(\n",
    "    cm, annot=True, fmt='d', cmap='Blues',\n",
    "    xticklabels=category_labels,\n",
    "    yticklabels=category_labels\n",
    ")\n",
    "plt.title(f'Confusion Matrix - {best_name}')\n",
    "plt.ylabel('Actual')\n",
    "plt.xlabel('Predicted')\n",
    "plt.xticks(rotation=25, ha='right')\n",
    "plt.tight_layout()\n",
    "plt.savefig('confusion_matrix_nlp.png', dpi=120, bbox_inches='tight')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ed70c3f",
   "metadata": {},
   "source": [
    "## 7. Keyword Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ba42883",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Top TF-IDF coefficients per category from Logistic Regression.\n",
    "# These are the words most strongly associated with each class,\n",
    "# offering an interpretability check on what the model has learned.\n",
    "\n",
    "lr = results['Logistic Regression']['model']\n",
    "vocab = np.array(vectoriser.get_feature_names_out())\n",
    "coefs = lr.coef_\n",
    "\n",
    "fig, axes = plt.subplots(2, 3, figsize=(16, 10))\n",
    "axes = axes.flatten()\n",
    "\n",
    "for i, (cat, ax) in enumerate(zip(category_labels, axes)):\n",
    "    top_idx = np.argsort(coefs[i])[-15:]\n",
    "    ax.barh(vocab[top_idx], coefs[i][top_idx],\n",
    "            color=sns.color_palette('muted', 6)[i], edgecolor='black')\n",
    "    ax.set_title(f'Top Keywords: {cat}')\n",
    "    ax.set_xlabel('TF-IDF Coefficient')\n",
    "\n",
    "plt.suptitle('Top Discriminative Keywords per Category (Logistic Regression)', fontsize=14)\n",
    "plt.tight_layout()\n",
    "plt.savefig('top_keywords_per_category.png', dpi=120, bbox_inches='tight')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56923083",
   "metadata": {},
   "source": [
    "## 8. Summary\n",
    "\n",
    "| Model | Accuracy | Weighted F1 |\n",
    "|---|---|---|\n",
    "| Naive Bayes | ~90% | ~90% |\n",
    "| Logistic Regression | ~93% | ~93% |\n",
    "| Linear SVM | ~94% | ~94% |\n",
    "\n",
    "Linear SVM performed best, which is consistent with its strength on\n",
    "high-dimensional sparse feature spaces like TF-IDF matrices.\n",
    "\n",
    "Bigrams meaningfully improved performance by capturing compound terms.\n",
    "Most misclassifications occur between thematically adjacent categories\n",
    "(e.g., Science/Medicine and Science/Space), which is expected.\n",
    "\n",
    "Removing email metadata before training was necessary to prevent the model\n",
    "from learning sender-based shortcuts rather than actual content patterns.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}