W11
Intermediate 3 sessions · 6 hours · Python

Week 11: SQL Queries and Time Series Analysis

Topics covered: SQL fundamentals, JOINs, CTEs, window functions (ROW_NUMBER, RANK, LAG, LEAD), ARIMA model identification, seasonal decomposition, ADF test, ACF/PACF plots

Learning objectives: By the end of this week you will be able to apply sql queries and time series analysis concepts to real datasets, write executable Python code for each technique, and complete both graded assignments independently.

Session 1: SQL Fundamentals

SQL is the language of relational databases. 80% of enterprise data lives in relational databases. SELECT specifies columns, FROM the table, WHERE filters rows (before aggregation), GROUP BY groups for aggregation, HAVING filters groups (after aggregation), ORDER BY sorts. INNER JOIN returns rows with matching keys in both tables. LEFT JOIN returns all rows from the left table and matched rows from the right.

-- Monthly loan disbursement summary
SELECT
    DATE_TRUNC('month', disbursement_date) AS month,
    COUNT(*)              AS total_loans,
    SUM(loan_amount)      AS total_disbursed,
    AVG(loan_amount)      AS avg_loan,
    COUNT(*) FILTER (WHERE status = 'default') AS defaults
FROM loans
WHERE disbursement_date >= '2024-01-01'
GROUP BY 1
HAVING COUNT(*) >= 10
ORDER BY 1;

-- INNER JOIN: link loans to customer demographics
SELECT c.customer_id, c.state, l.loan_amount, l.status
FROM customers c
INNER JOIN loans l ON c.customer_id = l.customer_id
WHERE l.disbursement_date >= '2024-01-01'
AND c.state IN ('Lagos', 'Oyo', 'Abuja')
ORDER BY l.loan_amount DESC;

Session 2: CTEs and Window Functions

A CTE (WITH clause) creates a temporary named result set improving query readability. Chain multiple CTEs in one query. Window functions compute values per row based on a window of related rows without collapsing to one row per group. PARTITION BY defines the window, ORDER BY the ordering within it. Key functions: ROW_NUMBER(), RANK(), LAG(col, n) for n-rows-before values, LEAD(col, n) for n-rows-after, SUM OVER for running totals.

-- CTE: high-frequency applicants (3+ loans in 30 days)
WITH app_freq AS (
    SELECT customer_id,
           COUNT(*) AS num_apps,
           MIN(application_date) AS first_app,
           MAX(application_date) AS last_app
    FROM loan_applications
    WHERE application_date >= CURRENT_DATE - INTERVAL '90 days'
    GROUP BY customer_id
)
SELECT c.customer_id, c.name, af.num_apps
FROM customers c
JOIN app_freq af USING (customer_id)
WHERE af.num_apps >= 3
AND (af.last_app - af.first_app) <= 30
ORDER BY af.num_apps DESC;

-- Window function: running total and YoY growth
SELECT
    month, monthly_revenue,
    SUM(monthly_revenue) OVER (ORDER BY month) AS cumulative_revenue,
    AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_3m,
    ROUND(100.0*(monthly_revenue - LAG(monthly_revenue,12) OVER (ORDER BY month))
          / LAG(monthly_revenue,12) OVER (ORDER BY month), 2) AS yoy_pct
FROM monthly_revenue;

Session 3: Time Series Analysis and ARIMA

A time series decomposes into: trend (long-term direction), seasonality (regular periodic pattern), cyclical (irregular long-term fluctuations), and residual. Use seasonal_decompose() for visualisation. ARIMA(p,d,q): p = AR order (use PACF), d = differencing order (until ADF test p < 0.05), q = MA order (use ACF). Check residuals with Ljung-Box test (should be white noise).

import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA

np.random.seed(42)
dates = pd.date_range('2019-01-01', periods=60, freq='MS')
trend = np.linspace(100, 180, 60)
seasonality = 15 * np.sin(2 * np.pi * np.arange(60) / 12)
series = pd.Series(trend + seasonality + np.random.normal(0,5,60), index=dates)

# Decomposition
decomp = seasonal_decompose(series, model='additive', period=12)
decomp.plot(); import matplotlib.pyplot as plt; plt.tight_layout(); plt.show()

# Stationarity test
adf_stat, adf_p = adfuller(series)[:2]
print(f'ADF p-value: {adf_p:.4f} - {"Stationary" if adf_p<0.05 else "Not stationary"}')

# ARIMA(1,1,1)
model = ARIMA(series, order=(1,1,1)).fit()
forecast = model.get_forecast(12).predicted_mean
print(forecast)

Week 11 Assignments

Submit completed notebooks to your GitHub repository before the next session. Feedback within 48 hours.

Write 5 SQL queries of increasing complexity on a SQLite database: (1) basic aggregation, (2) JOIN, (3) CTE, (4) window running total, (5) LAG() YoY growth. Run all from Python using sqlite3.

Apply Box-Jenkins to any monthly time series: plot, ADF test, difference, ACF/PACF, fit ARIMA, check residuals with Ljung-Box, forecast 12 periods with 95% prediction intervals.

Previous WeekNext: Week 12