W3
Beginner 3 sessions · 6 hours · Python

Week 3: Data Manipulation with Pandas

Topics covered: DataFrame creation, reading CSV/Excel, exploration methods, boolean filtering, .loc[]/.iloc[], groupby aggregations, .apply(), method chaining

Learning objectives: By the end of this week you will be able to apply data manipulation with pandas concepts to real datasets, write executable Python code for each technique, and complete both graded assignments independently.

Session 1: Pandas Fundamentals and Exploration

Pandas provides two data structures: Series (labelled 1D array) and DataFrame (labelled 2D table). Read data with pd.read_csv() or pd.read_excel(). Explore with: .shape (rows, cols), .info() (dtypes), .describe() (summary statistics), .head(n) (first n rows), .isnull().sum() (missing counts), .nunique() (unique values per column).

import pandas as pd
import numpy as np

# Create a DataFrame
data = {
    'patient_id':  ['P001','P002','P003','P004','P005'],
    'age':         [45, 52, 38, 61, 29],
    'gender':      ['F', 'M', 'F', 'M', 'F'],
    'baseline_bp': [138, 145, 128, 162, 118],
    'treatment':   ['drug','placebo','drug','drug','placebo']
}
df = pd.DataFrame(data)

print(df.shape)        # (5, 5)
print(df.dtypes)       # data types
print(df.describe())   # summary statistics
print(df.head(3))

Session 2: Selecting, Filtering and .loc[]

Select a column with df['col'] (returns Series) or df[['col1','col2']] (returns DataFrame). Filter rows with boolean conditions. Combine conditions with & (and) and | (or). Use ~ for NOT. Always use .loc[row_selector, col_selector] when both selecting and assigning to avoid the SettingWithCopyWarning.

import pandas as pd

df = pd.read_csv('customers.csv')  # your dataset

# Boolean filtering
high_balance   = df[df['balance'] > 500000]
female_prime   = df[(df['gender'] == 'F') & (df['credit_score'] >= 750)]
not_defaulted  = df[~df['is_defaulted']]

print(f'High balance: {len(high_balance)}')
print(f'Female prime: {len(female_prime)}')

# .loc[] - select rows and columns simultaneously
subset = df.loc[df['age'] > 35, ['customer_id', 'age', 'balance']]

# Safe assignment with .loc[]
df.loc[df['balance'] < 0, 'balance'] = 0

Session 3: GroupBy, Apply and Method Chaining

groupby() splits data into groups, applies a function to each group, and combines results. Chain .agg() for multiple aggregations. .apply() runs a function on each row (axis=1) or column (axis=0). Method chaining writes entire pipelines without intermediate variables - wrap in parentheses for multi-line readability.

import pandas as pd

df = pd.read_csv('sales.csv')  # your dataset

# GroupBy with multiple aggregations
summary = df.groupby('region').agg(
    total_sales       = ('sale_amount', 'sum'),
    avg_sale          = ('sale_amount', 'mean'),
    transaction_count = ('id', 'count')
).reset_index().sort_values('total_sales', ascending=False)
print(summary)

# .apply() with lambda - risk tier classification
df['risk_tier'] = df['credit_score'].apply(
    lambda s: 'Low' if s >= 750 else ('Medium' if s >= 650 else 'High')
)

# Method chain
result = (
    df
    .query('date >= "2024-01-01"')
    .assign(revenue_ngn = lambda x: x['revenue_usd'] * 1450)
    .groupby('category')
    .agg(total = ('revenue_ngn', 'sum'))
    .sort_values('total', ascending=False)
    .head(10)
)

Week 3 Assignments

Submit completed notebooks to your GitHub repository before the next session. Feedback within 48 hours.

Load the Titanic dataset. Find: (1) survival rate by passenger class, (2) mean age by gender among survivors, (3) the embarkation port with the highest average fare.

Load a sales dataset. Using groupby, produce a monthly revenue summary by product category with total revenue, transaction count, and average order value.

Previous WeekNext: Week 4