W5
Intermediate 3 sessions · 6 hours · Python

Week 5: Exploratory Data Analysis and Visualisation

Topics covered: Skewness and kurtosis, bivariate analysis, matplotlib figure architecture, seaborn (histplot/boxplot/violinplot/heatmap), Plotly interactive charts, publication-quality styling

Learning objectives: By the end of this week you will be able to apply exploratory data analysis and visualisation concepts to real datasets, write executable Python code for each technique, and complete both graded assignments independently.

Session 1: Statistical EDA

Univariate analysis examines one variable at a time. Compute: mean, median, standard deviation, skewness (positive = long right tail, negative = long left tail), and kurtosis (tail heaviness). Bivariate analysis examines pairs: correlation for two numerical variables, box plots for numerical vs categorical, cross-tabulation for two categorical variables.

import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(0)
df = pd.DataFrame({
    'income':    np.random.lognormal(12.5, 0.8, 1000),
    'age':       np.random.normal(38, 10, 1000).clip(18, 70),
    'education': np.random.choice(['Primary','Secondary','Tertiary'], 1000)
})

def eda_summary(series):
    return pd.Series({
        'mean':     round(series.mean(), 2),
        'median':   round(series.median(), 2),
        'std':      round(series.std(), 2),
        'skewness': round(stats.skew(series.dropna()), 3),
        'kurtosis': round(stats.kurtosis(series.dropna()), 3),
        'iqr':      round(series.quantile(0.75) - series.quantile(0.25), 2)
    })

print(eda_summary(df['income']))
print('Interpretation: skewness > 1 indicates strong right skew (income is log-normally distributed)')

Session 2: Matplotlib and Seaborn Charts

Create Figure and Axes explicitly with fig, ax = plt.subplots() for production code. Set DPI >= 150 for screen, 300 for print. Use seaborn's 'colorblind' palette for accessibility. Remove top and right spines. Always include axis labels with units, and a descriptive title. Save with plt.savefig('plot.png', dpi=300, bbox_inches='tight').

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

plt.rcParams.update({
    'figure.dpi': 150, 'font.size': 11,
    'axes.spines.top': False, 'axes.spines.right': False
})
palette = sns.color_palette('colorblind')

fig, axes = plt.subplots(2, 2, figsize=(12, 9))
fig.suptitle('Credit Applicant EDA Dashboard', fontsize=14, fontweight='bold')

sns.histplot(df['income'], kde=True, bins=40, ax=axes[0,0], color=palette[0])
axes[0,0].set_title('Income Distribution')
axes[0,0].set_xlabel('Annual Income (NGN)')

sns.boxplot(data=df, x='education', y='income', ax=axes[0,1], palette=palette[:3])
axes[0,1].set_title('Income by Education Level')

corr = df.select_dtypes('number').corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdBu_r', center=0, ax=axes[1,0])
axes[1,0].set_title('Correlation Matrix')

df['education'].value_counts().plot(kind='bar', ax=axes[1,1], color=palette[:3])
axes[1,1].set_title('Education Distribution')

plt.tight_layout()
plt.savefig('eda_dashboard.png', dpi=300, bbox_inches='tight')
plt.show()

Session 3: Plotly Interactive Charts

plotly.express (import as px) creates interactive charts in one function call. Charts are JavaScript-based and render in Jupyter or as standalone HTML files. Use .show() in Jupyter, .write_html('chart.html') to save. Key functions: px.histogram(), px.box(), px.scatter(), px.bar(), px.line(). Pass hover_data to control tooltip content.

import plotly.express as px

# Interactive scatter plot
fig = px.scatter(
    df,
    x='age', y='income',
    color='education',
    hover_data=['income','age','education'],
    title='Income vs Age by Education Level',
    labels={'income': 'Annual Income (NGN)', 'age': 'Age (Years)'},
    color_discrete_sequence=px.colors.colorbrewer.Set1
)
fig.update_layout(height=500, template='plotly_white')
fig.show()
# Save as interactive HTML
# fig.write_html('income_scatter.html')

Week 5 Assignments

Submit completed notebooks to your GitHub repository before the next session. Feedback within 48 hours.

Using a financial or health dataset from Kaggle: (1) univariate analysis of all numerical variables with skewness interpretation, (2) bivariate scatter matrix, (3) box plots by target variable, (4) correlation heatmap, (5) 3 written findings with business implications.

Create a Plotly HTML dashboard with: a distribution chart with KDE overlay, a grouped bar chart, and a scatter plot with colour encoding.

Previous WeekNext: Week 6