Week 6: Statistical Foundations and Probability | Python Data Science Tutorials

Learning objectives: By the end of this week you will be able to apply statistical foundations and probability concepts to real datasets, write executable Python code for each technique, and complete both graded assignments independently.

Session 1: Probability and Bayes Theorem

Probability measures likelihood from 0 (impossible) to 1 (certain). Conditional probability P(A|B) = P(A and B) / P(B). Bayes theorem: P(H|E) = P(E|H) * P(H) / P(E). In words: posterior = likelihood * prior / marginal evidence. This underpins Bayesian inference, Naive Bayes classifiers, and medical diagnostic testing. A striking result: even a test with 94% sensitivity has only a 38.8% positive predictive value when disease prevalence is 2%.

from scipy import stats
import numpy as np

# Bayes theorem: medical screening
# Disease prevalence: 2%, Test sensitivity: 94%, Specificity: 97%
prevalence  = 0.02
sensitivity = 0.94
fpr = 1 - 0.97  # false positive rate

p_positive = sensitivity * prevalence + fpr * (1 - prevalence)
p_disease_given_positive = (sensitivity * prevalence) / p_positive

print(f'P(Disease | Positive Test): {p_disease_given_positive:.4f} ({p_disease_given_positive*100:.1f}%)')
# Only 38.8% - this is the false discovery rate problem in low-prevalence screening

Session 2: Probability Distributions

Binomial(n, p) models count of successes in n independent trials each with probability p. Mean = np, variance = np(1-p). Normal distribution is characterised by mean (mu) and standard deviation (sigma). The 68-95-99.7 rule: 68% within 1 sigma, 95% within 2 sigma, 99.7% within 3 sigma. The Central Limit Theorem: sample means approach a Normal distribution as n increases, regardless of the population distribution.

from scipy import stats
import numpy as np

# Binomial - loan default portfolio
binom_dist = stats.binom(n=500, p=0.03)
print(f'Expected defaults:        {binom_dist.mean():.1f}')
print(f'Std deviation:            {binom_dist.std():.2f}')
print(f'P(more than 25 defaults): {1-binom_dist.cdf(25):.4f}')
print(f'95th percentile:          {binom_dist.ppf(0.95):.0f} defaults')

# Normal - credit score
norm_dist = stats.norm(loc=680, scale=75)
p_prime = 1 - norm_dist.cdf(750)
print(f'P(score > 750): {p_prime:.4f} ({p_prime*100:.1f}%)')

# CLT demonstration
population = stats.expon(scale=1000).rvs(10000)
sample_means = [np.mean(np.random.choice(population, 40)) for _ in range(2000)]
print(f'Mean of sample means: {np.mean(sample_means):.2f}')  # approaches 1000

Session 3: Confidence Intervals

Correct interpretation: if we repeated sampling and CI construction 100 times, approximately 95 of the 100 intervals would contain the true parameter. The CI is a random variable - the parameter is fixed. For unknown population std (the usual case), use a t-interval with n-1 degrees of freedom via scipy.stats.t.interval(). Width decreases proportionally to 1/sqrt(n).

import numpy as np
from scipy import stats

np.random.seed(7)
customer_spend = np.random.normal(loc=45000, scale=12000, size=60)

xbar = np.mean(customer_spend)
s    = np.std(customer_spend, ddof=1)
n    = len(customer_spend)

# 95% t-interval
ci = stats.t.interval(confidence=0.95, df=n-1, loc=xbar, scale=s/np.sqrt(n))
print(f'Sample mean:     NGN {xbar:,.2f}')
print(f'95% CI:          NGN {ci[0]:,.2f} to NGN {ci[1]:,.2f}')
print(f'Margin of error: NGN {(ci[1]-ci[0])/2:,.2f}')

Week 6 Assignments

Submit completed notebooks to your GitHub repository before the next session. Feedback within 48 hours.

Compute and interpret 90%, 95%, and 99% CIs for the mean of a numerical column in a real dataset. Explain what each interval means in context and how width changes with confidence level.

Demonstrate the CLT with a simulation: 5,000 samples of size 40 from a skewed distribution. Plot sample mean distribution and overlay the theoretical normal.

Previous Week Next: Week 7