Skip to main content

Phase 8 📊 — Data Science

Topics: pandas Series, DataFrames, CSV handling, basic data analysis

Transform raw data into insights. pandas is the most popular Python library for data analysis — you'll use it to load, clean, explore, and present data.


🔄 Exercise Flow


📚 Prerequisites

Before starting these exercises, make sure you've read:


🔰 Starter: Series Operations Lab

Time: 12 minutes

Work with pandas Series to perform element-wise operations, filtering, and statistical calculations.

Learning Objectives

  • Create a Series from a list and a dictionary
  • Perform vectorised arithmetic operations
  • Use boolean indexing to filter values
  • Compute descriptive statistics (.mean(), .std(), .min(), .max())
  • Handle missing values with .isna() and .fillna()

Starter Code

# Starter: Series Operations Lab
import pandas as pd

# Daily temperatures (°C) for a week
temps = pd.Series([28.5, 30.1, 29.8, None, 31.2, 27.9, 26.5],
index=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])

# TODO 1: Print the Series
# TODO 2: How many days have data? (count non-null values)
# TODO 3: What is the average temperature for the week?
# TODO 4: What is the highest and lowest temperature?
# TODO 5: Fill the missing value with the average temperature
# TODO 6: Convert all temps to Fahrenheit using vectorised operations (F = C * 9/5 + 32)
# TODO 7: Find all days where temp > 30°C (after filling missing)
# TODO 8: Add 5 bonus questions of your own using Series methods

# Your code here 👇

Expected Output

Mon 28.5
Tue 30.1
Wed 29.8
Thu NaN
Fri 31.2
Sat 27.9
Sun 26.5
dtype: float64

=== ANALYSIS ===
Non-null count: 6 / 7
Average temp: 29.0°C
Max temp: 31.2°C (Fri)
Min temp: 26.5°C (Sun)
Filled NaN: 29.0°C

Days > 30°C:
Tue 30.1
Fri 31.2

Temps in Fahrenheit:
Mon 83.30
Tue 86.18
Wed 85.64
...

⭐ Medium: Sales Data Analysis with DataFrames

Time: 22 minutes

Load a sales dataset into a DataFrame, perform cleaning, grouping, aggregation, and visualisation-style analysis (print-based).

Learning Objectives

  • Create a DataFrame from a dictionary
  • Use .head(), .info(), .describe() for exploration
  • Filter rows with boolean conditions
  • Group data with .groupby() and aggregate with .agg()
  • Add new calculated columns
  • Sort and rank data

Starter Code

# Medium: Sales Data Analysis with DataFrames
import pandas as pd

# Sample sales data
sales_data = {
"Date": [
"2025-01-05", "2025-01-05", "2025-01-06", "2025-01-06",
"2025-01-07", "2025-01-07", "2025-01-08", "2025-01-08"
],
"Product": [
"Laptop", "Mouse", "Laptop", "Keyboard",
"Monitor", "Mouse", "Laptop", "Keyboard"
],
"Category": [
"Electronics", "Accessories", "Electronics", "Accessories",
"Electronics", "Accessories", "Electronics", "Accessories"
],
"Units": [3, 15, 2, 10, 5, 20, 4, 8],
"Unit_Price": [1200, 25, 1200, 80, 300, 25, 1200, 80],
}

df = pd.DataFrame(sales_data)

# TODO 1: Add a 'Revenue' column (Units * Unit_Price)
# TODO 2: Show basic info and summary statistics
# TODO 3: Filter to show only Electronics with revenue > $1000
# TODO 4: Calculate total revenue per product
# TODO 5: Calculate total revenue per category
# TODO 6: Find the best-selling product (most units sold)
# TODO 7: Add a 'Profit_Margin' column: Electronics = 30%, Accessories = 50%
# TODO 8: Sort products by total revenue descending
# TODO 9: Create a summary report

# Your code here 👇

Expected Output

=== SALES DATA OVERVIEW ===
Shape: (8, 5)
Columns: Date, Product, Category, Units, Unit_Price

=== SUMMARY STATISTICS ===
Units Unit_Price Revenue
count 8.00 8.00 8.00
mean 8.38 415.00 4950.00
std 6.46 497.71 5575.35
min 2.00 25.00 375.00
max 20.00 1200.00 15000.00

=== REVENUE BY PRODUCT ===
Laptop $10,800.00
Mouse $875.00
Keyboard $1,440.00
Monitor $1,500.00

=== REVENUE BY CATEGORY ===
Electronics $14,100.00
Accessories $2,315.00

=== BEST SELLERS ===
Most Units: Mouse (35 units)
Top Revenue: Laptop ($10,800)

🏆 Hard: Complete Data Analysis Pipeline

Time: 35 minutes

Build a complete data analysis pipeline that generates sample data, loads it into pandas, cleans it, performs multi-dimensional analysis, and exports results.

Learning Objectives

  • Generate realistic sample data using pandas and random data
  • Perform data cleaning (handle missing values, remove duplicates, fix types)
  • Use multi-level grouping and aggregation
  • Create derived metrics and calculated columns
  • Implement conditional logic with np.where() or .apply()
  • Export analysis results to CSV
  • Build a reusable analysis pipeline function

Starter Code

# Hard: Complete Data Analysis Pipeline
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta

# --- Generate realistic sales data ---
def generate_sales_data(n=200):
"""Generate n rows of simulated sales data."""
products = {
"Laptop": {"category": "Electronics", "price": 1200, "cost": 800},
"Phone": {"category": "Electronics", "price": 800, "cost": 550},
"Tablet": {"category": "Electronics", "price": 400, "cost": 280},
"Mouse": {"category": "Accessories", "price": 25, "cost": 10},
"Keyboard": {"category": "Accessories", "price": 80, "cost": 35},
"Monitor": {"category": "Electronics", "price": 300, "cost": 200},
"Headphones": {"category": "Accessories", "price": 150, "cost": 70},
"USB Hub": {"category": "Accessories", "price": 35, "cost": 15},
}
cities = ["Mumbai", "Delhi", "Bengaluru", "Chennai", "Pune"]
records = []
base_date = datetime(2025, 1, 1)

for _ in range(n):
product_name = random.choice(list(products.keys()))
product_info = products[product_name]
date = base_date + timedelta(days=random.randint(0, 90))
units = random.randint(1, 10)
records.append({
"Date": date.strftime("%Y-%m-%d"),
"Product": product_name,
"Category": product_info["category"],
"Price": product_info["price"],
"Cost": product_info["cost"],
"Units": units,
"City": random.choice(cities),
"Rating": round(random.uniform(1, 5), 1),
})

df = pd.DataFrame(records)
# Introduce some missing values and duplicates
df.loc[random.sample(range(n), 5), "Rating"] = None
df = pd.concat([df, df.iloc[:3]]) # Add 3 duplicates
return df

# --- Load data ---
df = generate_sales_data(200)
print(f"Raw data: {df.shape[0]} rows, {df.shape[1]} columns")

# TODO 1: Data Cleaning
# - Remove duplicate rows (keep first)
# - Fill missing ratings with the average rating
# - Add a 'Revenue' column (Units * Price)
# - Add a 'Profit' column (Revenue - Units * Cost)
# - Add a 'Profit_Margin_%' column

# TODO 2: Monthly Analysis
# - Extract month from Date into a new column
# - Calculate total revenue and profit per month

# TODO 3: City Analysis
# - Which city generates the most revenue?
# - Which city has the highest average rating?

# TODO 4: Product Performance
# - Top 3 products by revenue
# - Bottom 3 products by profit margin

# TODO 5: Category Summary
# - Total revenue, profit, and avg margin per category

# TODO 6: Advanced Analysis
# - Find products where profit margin > 50% AND units sold > 50
# - Calculate the correlation between Rating and Revenue
# - Find the best-selling product in each city

# TODO 7: Export
# - Save the cleaned dataset to "cleaned_sales.csv"
# - Save the monthly summary to "monthly_summary.csv"
# - Save the city analysis to "city_analysis.csv"

# TODO 8: Wrap everything in a function: run_analysis(n_records=200)

# Your code here 👇

Expected Output

═══ DATA ANALYSIS PIPELINE ═══

[1/8] Generating data... 200 rows generated.

[2/8] Cleaning...
Duplicates removed: 3
Missing ratings filled: 5
Shape: 197 × 11

[3/8] Monthly Revenue & Profit:
Month Revenue Profit Margin
Jan $28,450.00 $10,230.00 35.9%
Feb $32,100.00 $11,890.00 37.0%
Mar $35,670.00 $13,450.00 37.7%

[4/8] City Performance:
Mumbai $24,500.00 ⭐ 4.2 avg rating
Delhi $22,800.00 ⭐ 4.0 avg rating
Bengaluru $21,200.00 ⭐ 4.5 avg rating ← Highest rated!
Chennai $18,900.00 ⭐ 3.8 avg rating
Pune $15,820.00 ⭐ 4.1 avg rating

[5/8] Top 3 Products by Revenue:
1. Laptop — $38,400.00
2. Phone — $22,400.00
3. Monitor — $10,200.00

[6/8] Categories:
Electronics Revenue: $72,100.00 Margin: 34.2%
Accessories Revenue: $31,870.00 Margin: 58.5%

[7/8] Best-Selling Product by City:
Mumbai: Laptop (25 units)
Delhi: Phone (20 units)
Bengaluru: Laptop (18 units)

[8/8] Exported:
✓ cleaned_sales.csv
✓ monthly_summary.csv
✓ city_analysis.csv

═══ PIPELINE COMPLETE ═══

💡 Tips for Success

  • Always check df.info() and df.describe() first — they tell you about data types, missing values, and basic statistics instantly.
  • Use axis=0 for rows, axis=1 for columns — common confusion when using .drop() and .apply().
  • Vectorised operations are fast — avoid looping over DataFrames. Use .apply() or built-in methods instead.
  • loc vs iloc: df.loc["label"] uses index labels; df.iloc[0] uses integer position.
  • Chain methods for readability: df.dropna().groupby("City")["Revenue"].sum().sort_values(ascending=False).
  • Install pandas first: pip install pandas.

← Back to Exercises