Natural Language Processing2024github.com/kmexa

NLP Text Classification: SMS Spam Detection

Complete text classification pipeline from raw SMS text to a deployable classifier. Covers text cleaning, tokenisation, stemming, TF-IDF feature extraction with bigrams, model comparison, and interpretability through discriminative word analysis.

Download Notebook (.ipynb) View on GitHub

Methodology

01Text normalisation: URLs replaced with "url", phone numbers with "phone", currency symbols with "money" - reduces vocabulary noise

02PorterStemmer collapses inflected forms (winning/winner/wins to win) to reduce feature space dimensionality

03TF-IDF vectorisation with unigrams and bigrams, 5,000 features; bigrams capture multi-word spam patterns such as "call now" and "win cash"

04Three-model comparison: Multinomial Naive Bayes, Logistic Regression, Linear SVM under identical conditions

05LinearSVC selected: finds a maximum-margin hyperplane in TF-IDF space; outperforms probabilistic models on this task

06Feature coefficients extracted to reveal the 20 most predictive spam and ham words for interpretability

Results

98.9%

Accuracy

0.981

CV F1

5,572 SMS

Dataset

Model	Accuracy	5-Fold CV F1
Naive Bayes	97.9%	0.942
Logistic Regression	98.6%	0.967
Linear SVM (selected)	98.9%	0.981

TF-IDFLinearSVCNLTKPorterStemmerscikit-learnbigramsPython

Download spam_classifier.ipynb Request Similar Project

NLP Text Classification: SMS Spam Detection

Methodology

Results

More Portfolio Projects