Topics covered: CRISP-DM workflow, capstone project report structure, GitHub portfolio, UK/Nigerian CV for data scientists, technical interview preparation
Learning objectives: By the end of this week you will be able to apply capstone project and professional development concepts to real datasets, write executable Python code for each technique, and complete both graded assignments independently.
CRISP-DM has 6 phases: Business Understanding (why does this problem matter?), Data Understanding (what data exists and what does it reveal?), Data Preparation (clean, engineer, split), Modelling (train, tune, validate), Evaluation (does the model solve the business problem?), and Deployment (make available to users). The process is iterative - evaluation often sends you back to business understanding. A capstone report has 6 sections: Executive Summary, Data Description, Methodology, Results, Discussion, and Recommendations.
# Recommended capstone project structure
# project-name/
# |
# +-- README.md # Problem, setup, results summary
# +-- data/
# | +-- raw/ # Original unmodified data
# | +-- processed/ # Cleaned and engineered
# +-- notebooks/
# | +-- 01_eda.ipynb
# | +-- 02_modelling.ipynb
# | +-- 03_evaluation.ipynb
# +-- src/
# | +-- preprocess.py
# | +-- features.py
# | +-- model.py
# +-- reports/
# | +-- final_report.pdf
# +-- requirements.txt
#
# README.md minimum content:
# - Problem statement (2 sentences)
# - Key findings (3 bullet points)
# - Model performance table (all metrics)
# - Reproducibility: 4 copy-paste commands
UK and Nigerian recruiters assess GitHub for: regular commit history (signals genuine work), README quality, code quality (functions not scripts, docstrings, no hardcoded paths), and project breadth. Pin 6 repositories: (1) statistical analysis with a clear finding, (2) ML pipeline with documented metrics, (3) data cleaning on a messy real-world dataset, (4) SQL analysis, (5) visualisation project, (6) a tool or package. Write each README with: one-sentence title, 3-line description, results table before any code, setup instructions, and a screenshot/chart at the top.
# Professional git commit workflow
# Every commit message must explain WHAT and WHY
# Bad: git commit -m 'update'
# Bad: git commit -m 'fix stuff'
# Good:
# git commit -m 'Add SMOTE oversampling - improves F1 from 0.61 to 0.78'
# git commit -m 'Impute income with median (skewness=2.3, not mean)'
# git commit -m 'Fix data leakage: move StandardScaler inside Pipeline'
# .gitignore for data science projects
# .ipynb_checkpoints/
# __pycache__/
# *.pyc
# data/raw/*.csv
# .env
# models/*.pkl
UK and Nigerian data science interviews: (1) take-home assignment 24-72 hours, (2) technical interview 30-60 min (statistics, ML concepts, SQL, live coding), (3) final panel. Primary skill tested: framing a problem clearly before solving it. Statistics questions: Type I/II errors, when to use t-test vs Mann-Whitney, what is multicollinearity. ML questions: how does Random Forest prevent overfitting, explain gradient boosting plainly, precision vs recall tradeoff. SQL: second-highest salary, window function for running total, WHERE vs HAVING.
# Five high-frequency Python interview questions
# 1. Find duplicate rows in a DataFrame
duplicates = df[df.duplicated(keep=False)]
# 2. Compute 30-day rolling mean on a time series
df['rolling_30d'] = df['value'].rolling(window=30, min_periods=1).mean()
# 3. One-hot encode without sklearn
pd.get_dummies(df['category'], prefix='cat', drop_first=True)
# 4. Flatten a nested list
nested = [[1,2],[3,4],[5]]
flat = [x for sublist in nested for x in sublist]
# 5. Count NaN per column as a percentage
(df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
Submit completed notebooks to your GitHub repository before the next session. Feedback within 48 hours.
Build a complete end-to-end capstone project on a real dataset (min 5,000 rows, 8 features): cleaned data, 3+ ML models with cross-validation, hyperparameter tuning, SHAP analysis, 6-section written report, and public GitHub repository.
Answer 5 mock interview questions in writing as if speaking to an interviewer: (1) explain bias-variance tradeoff with a project example, (2) 15% missing values in a column - walk through your decision process, (3) 98% training accuracy, 71% test accuracy - what is wrong and 4 fixes, (4) SQL: customers who purchased in Jan 2024 but not Feb 2024, (5) explain p-value to a marketing manager.