Topics covered: K-means (algorithm, elbow method, silhouette score), hierarchical clustering (dendrogram, Ward linkage), DBSCAN, PCA (eigenvalues, explained variance, scree plot), t-SNE visualisation
Learning objectives: By the end of this week you will be able to apply unsupervised machine learning concepts to real datasets, write executable Python code for each technique, and complete both graded assignments independently.
K-means minimises within-cluster sum of squares (WCSS). Algorithm: (1) initialise k centroids (use k-means++ for better initialisation), (2) assign each point to nearest centroid, (3) recompute centroids as cluster means, (4) repeat until convergence. Elbow method: plot WCSS vs k and find the 'elbow'. Silhouette score (-1 to 1): measures how similar a point is to its own cluster vs neighbours. Maximise silhouette to choose optimal k.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import numpy as np
np.random.seed(42)
X, _ = make_blobs(n_samples=400, centers=4, cluster_std=1.2)
X_sc = StandardScaler().fit_transform(X)
wcss, sil = [], []
for k in range(2, 11):
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
labels = km.fit_predict(X_sc)
wcss.append(km.inertia_)
sil.append(silhouette_score(X_sc, labels))
best_k = range(2, 11)[np.argmax(sil)]
print(f'Best k by silhouette: {best_k} (score: {max(sil):.3f})')
PCA finds principal components - orthogonal directions of maximum variance. Components are eigenvectors of the covariance matrix ordered by eigenvalue magnitude. Explained variance ratio = eigenvalue / sum(eigenvalues). Scree plot shows eigenvalues in decreasing order. Keep enough components to explain 80-95% of total variance. Always standardise before PCA.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine
import numpy as np
import matplotlib.pyplot as plt
data = load_wine()
X = StandardScaler().fit_transform(data.data)
y = data.target
pca = PCA().fit(X)
# Cumulative explained variance
cumvar = np.cumsum(pca.explained_variance_ratio_)
n_90pct = np.argmax(cumvar >= 0.90) + 1
print(f'{n_90pct} components explain >= 90% of variance')
fig, ax = plt.subplots(figsize=(8,4))
ax.plot(range(1, len(cumvar)+1), cumvar, 'bs-')
ax.axhline(0.90, color='red', linestyle='--', label='90% threshold')
ax.set_xlabel('Number of Components')
ax.set_ylabel('Cumulative Explained Variance')
ax.set_title('PCA Scree Plot - Wine Dataset')
ax.legend(); plt.tight_layout(); plt.show()
t-SNE (t-distributed Stochastic Neighbour Embedding) preserves local structure for 2D/3D visualisation. Unlike PCA, it is non-linear and cannot be applied to new data. Perplexity (5-50) controls effective neighbourhood size. Always preprocess with PCA to 50 components before applying t-SNE to high-dimensional data. Use PCA for preprocessing/modelling; use t-SNE for visualisation only.
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
data = load_digits()
X = StandardScaler().fit_transform(data.data)
y = data.target
# PCA to 50 components first (speeds up t-SNE)
X_pca = PCA(n_components=50, random_state=42).fit_transform(X)
# t-SNE to 2D
X_tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=42).fit_transform(X_pca)
fig, ax = plt.subplots(figsize=(10,8))
sc = ax.scatter(X_tsne[:,0], X_tsne[:,1], c=y, cmap='tab10', s=8, alpha=0.7)
plt.colorbar(sc, ax=ax, label='Digit (0-9)')
ax.set_title('t-SNE Visualisation of MNIST Digits (PCA pre-processed)')
ax.axis('off'); plt.tight_layout(); plt.show()
Submit completed notebooks to your GitHub repository before the next session. Feedback within 48 hours.
Customer segmentation on an e-commerce dataset (RFM features). Apply K-means k=4. Visualise with PCA and t-SNE. For each cluster, compute mean features and write a 2-sentence business description. Recommend a marketing strategy per segment.
Apply PCA to MNIST. Reconstruct images using 10, 50, 100, and 200 components. Display side by side and report MSE reconstruction error.