Natural Language Processing2024github.com/kmexa

NLP Text Classification: SMS Spam Detection

Complete text classification pipeline from raw SMS text to a deployable classifier. Covers text cleaning, tokenisation, stemming, TF-IDF feature extraction with bigrams, model comparison, and interpretability through discriminative word analysis.

Download Notebook (.ipynb) View on GitHub

Methodology

01Text normalisation: URLs replaced with "url", phone numbers with "phone", currency symbols with "money" - reduces vocabulary noise
02PorterStemmer collapses inflected forms (winning/winner/wins to win) to reduce feature space dimensionality
03TF-IDF vectorisation with unigrams and bigrams, 5,000 features; bigrams capture multi-word spam patterns such as "call now" and "win cash"
04Three-model comparison: Multinomial Naive Bayes, Logistic Regression, Linear SVM under identical conditions
05LinearSVC selected: finds a maximum-margin hyperplane in TF-IDF space; outperforms probabilistic models on this task
06Feature coefficients extracted to reveal the 20 most predictive spam and ham words for interpretability

Results

98.9%
Accuracy
0.981
CV F1
5,572 SMS
Dataset
ModelAccuracy5-Fold CV F1
Naive Bayes97.9%0.942
Logistic Regression98.6%0.967
Linear SVM (selected)98.9%0.981
TF-IDFLinearSVCNLTKPorterStemmerscikit-learnbigramsPython

More Portfolio Projects

Fraud Detection NLP Spam Classifier Customer Churn News Classifier TreasuryIQ