Phase 8 📊 — Data Science
Topics: pandas Series, DataFrames, CSV handling, basic data analysis
Transform raw data into insights. pandas is the most popular Python library for data analysis — you'll use it to load, clean, explore, and present data.
🔄 Exercise Flow
📚 Prerequisites
Before starting these exercises, make sure you've read:
- Introduction to pandas — What pandas is and how to install it
- pandas Series — 1D labelled arrays
- DataFrame Basics — Creating and inspecting DataFrames
- Indexing & Selection —
loc,iloc, and boolean filtering - CSV with pandas — Reading and writing CSV files
🔰 Starter: Series Operations Lab
Time: 12 minutes
Work with pandas Series to perform element-wise operations, filtering, and statistical calculations.
Learning Objectives
- Create a Series from a list and a dictionary
- Perform vectorised arithmetic operations
- Use boolean indexing to filter values
- Compute descriptive statistics (
.mean(),.std(),.min(),.max()) - Handle missing values with
.isna()and.fillna()
Starter Code
# Starter: Series Operations Lab
import pandas as pd
# Daily temperatures (°C) for a week
temps = pd.Series([28.5, 30.1, 29.8, None, 31.2, 27.9, 26.5],
index=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])
# TODO 1: Print the Series
# TODO 2: How many days have data? (count non-null values)
# TODO 3: What is the average temperature for the week?
# TODO 4: What is the highest and lowest temperature?
# TODO 5: Fill the missing value with the average temperature
# TODO 6: Convert all temps to Fahrenheit using vectorised operations (F = C * 9/5 + 32)
# TODO 7: Find all days where temp > 30°C (after filling missing)
# TODO 8: Add 5 bonus questions of your own using Series methods
# Your code here 👇
Expected Output
Mon 28.5
Tue 30.1
Wed 29.8
Thu NaN
Fri 31.2
Sat 27.9
Sun 26.5
dtype: float64
=== ANALYSIS ===
Non-null count: 6 / 7
Average temp: 29.0°C
Max temp: 31.2°C (Fri)
Min temp: 26.5°C (Sun)
Filled NaN: 29.0°C
Days > 30°C:
Tue 30.1
Fri 31.2
Temps in Fahrenheit:
Mon 83.30
Tue 86.18
Wed 85.64
...
⭐ Medium: Sales Data Analysis with DataFrames
Time: 22 minutes
Load a sales dataset into a DataFrame, perform cleaning, grouping, aggregation, and visualisation-style analysis (print-based).
Learning Objectives
- Create a DataFrame from a dictionary
- Use
.head(),.info(),.describe()for exploration - Filter rows with boolean conditions
- Group data with
.groupby()and aggregate with.agg() - Add new calculated columns
- Sort and rank data
Starter Code
# Medium: Sales Data Analysis with DataFrames
import pandas as pd
# Sample sales data
sales_data = {
"Date": [
"2025-01-05", "2025-01-05", "2025-01-06", "2025-01-06",
"2025-01-07", "2025-01-07", "2025-01-08", "2025-01-08"
],
"Product": [
"Laptop", "Mouse", "Laptop", "Keyboard",
"Monitor", "Mouse", "Laptop", "Keyboard"
],
"Category": [
"Electronics", "Accessories", "Electronics", "Accessories",
"Electronics", "Accessories", "Electronics", "Accessories"
],
"Units": [3, 15, 2, 10, 5, 20, 4, 8],
"Unit_Price": [1200, 25, 1200, 80, 300, 25, 1200, 80],
}
df = pd.DataFrame(sales_data)
# TODO 1: Add a 'Revenue' column (Units * Unit_Price)
# TODO 2: Show basic info and summary statistics
# TODO 3: Filter to show only Electronics with revenue > $1000
# TODO 4: Calculate total revenue per product
# TODO 5: Calculate total revenue per category
# TODO 6: Find the best-selling product (most units sold)
# TODO 7: Add a 'Profit_Margin' column: Electronics = 30%, Accessories = 50%
# TODO 8: Sort products by total revenue descending
# TODO 9: Create a summary report
# Your code here 👇
Expected Output
=== SALES DATA OVERVIEW ===
Shape: (8, 5)
Columns: Date, Product, Category, Units, Unit_Price
=== SUMMARY STATISTICS ===
Units Unit_Price Revenue
count 8.00 8.00 8.00
mean 8.38 415.00 4950.00
std 6.46 497.71 5575.35
min 2.00 25.00 375.00
max 20.00 1200.00 15000.00
=== REVENUE BY PRODUCT ===
Laptop $10,800.00
Mouse $875.00
Keyboard $1,440.00
Monitor $1,500.00
=== REVENUE BY CATEGORY ===
Electronics $14,100.00
Accessories $2,315.00
=== BEST SELLERS ===
Most Units: Mouse (35 units)
Top Revenue: Laptop ($10,800)
🏆 Hard: Complete Data Analysis Pipeline
Time: 35 minutes
Build a complete data analysis pipeline that generates sample data, loads it into pandas, cleans it, performs multi-dimensional analysis, and exports results.
Learning Objectives
- Generate realistic sample data using pandas and random data
- Perform data cleaning (handle missing values, remove duplicates, fix types)
- Use multi-level grouping and aggregation
- Create derived metrics and calculated columns
- Implement conditional logic with
np.where()or.apply() - Export analysis results to CSV
- Build a reusable analysis pipeline function
Starter Code
# Hard: Complete Data Analysis Pipeline
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta
# --- Generate realistic sales data ---
def generate_sales_data(n=200):
"""Generate n rows of simulated sales data."""
products = {
"Laptop": {"category": "Electronics", "price": 1200, "cost": 800},
"Phone": {"category": "Electronics", "price": 800, "cost": 550},
"Tablet": {"category": "Electronics", "price": 400, "cost": 280},
"Mouse": {"category": "Accessories", "price": 25, "cost": 10},
"Keyboard": {"category": "Accessories", "price": 80, "cost": 35},
"Monitor": {"category": "Electronics", "price": 300, "cost": 200},
"Headphones": {"category": "Accessories", "price": 150, "cost": 70},
"USB Hub": {"category": "Accessories", "price": 35, "cost": 15},
}
cities = ["Mumbai", "Delhi", "Bengaluru", "Chennai", "Pune"]
records = []
base_date = datetime(2025, 1, 1)
for _ in range(n):
product_name = random.choice(list(products.keys()))
product_info = products[product_name]
date = base_date + timedelta(days=random.randint(0, 90))
units = random.randint(1, 10)
records.append({
"Date": date.strftime("%Y-%m-%d"),
"Product": product_name,
"Category": product_info["category"],
"Price": product_info["price"],
"Cost": product_info["cost"],
"Units": units,
"City": random.choice(cities),
"Rating": round(random.uniform(1, 5), 1),
})
df = pd.DataFrame(records)
# Introduce some missing values and duplicates
df.loc[random.sample(range(n), 5), "Rating"] = None
df = pd.concat([df, df.iloc[:3]]) # Add 3 duplicates
return df
# --- Load data ---
df = generate_sales_data(200)
print(f"Raw data: {df.shape[0]} rows, {df.shape[1]} columns")
# TODO 1: Data Cleaning
# - Remove duplicate rows (keep first)
# - Fill missing ratings with the average rating
# - Add a 'Revenue' column (Units * Price)
# - Add a 'Profit' column (Revenue - Units * Cost)
# - Add a 'Profit_Margin_%' column
# TODO 2: Monthly Analysis
# - Extract month from Date into a new column
# - Calculate total revenue and profit per month
# TODO 3: City Analysis
# - Which city generates the most revenue?
# - Which city has the highest average rating?
# TODO 4: Product Performance
# - Top 3 products by revenue
# - Bottom 3 products by profit margin
# TODO 5: Category Summary
# - Total revenue, profit, and avg margin per category
# TODO 6: Advanced Analysis
# - Find products where profit margin > 50% AND units sold > 50
# - Calculate the correlation between Rating and Revenue
# - Find the best-selling product in each city
# TODO 7: Export
# - Save the cleaned dataset to "cleaned_sales.csv"
# - Save the monthly summary to "monthly_summary.csv"
# - Save the city analysis to "city_analysis.csv"
# TODO 8: Wrap everything in a function: run_analysis(n_records=200)
# Your code here 👇
Expected Output
═══ DATA ANALYSIS PIPELINE ═══
[1/8] Generating data... 200 rows generated.
[2/8] Cleaning...
Duplicates removed: 3
Missing ratings filled: 5
Shape: 197 × 11
[3/8] Monthly Revenue & Profit:
Month Revenue Profit Margin
Jan $28,450.00 $10,230.00 35.9%
Feb $32,100.00 $11,890.00 37.0%
Mar $35,670.00 $13,450.00 37.7%
[4/8] City Performance:
Mumbai $24,500.00 ⭐ 4.2 avg rating
Delhi $22,800.00 ⭐ 4.0 avg rating
Bengaluru $21,200.00 ⭐ 4.5 avg rating ← Highest rated!
Chennai $18,900.00 ⭐ 3.8 avg rating
Pune $15,820.00 ⭐ 4.1 avg rating
[5/8] Top 3 Products by Revenue:
1. Laptop — $38,400.00
2. Phone — $22,400.00
3. Monitor — $10,200.00
[6/8] Categories:
Electronics Revenue: $72,100.00 Margin: 34.2%
Accessories Revenue: $31,870.00 Margin: 58.5%
[7/8] Best-Selling Product by City:
Mumbai: Laptop (25 units)
Delhi: Phone (20 units)
Bengaluru: Laptop (18 units)
[8/8] Exported:
✓ cleaned_sales.csv
✓ monthly_summary.csv
✓ city_analysis.csv
═══ PIPELINE COMPLETE ═══
💡 Tips for Success
- Always check
df.info()anddf.describe()first — they tell you about data types, missing values, and basic statistics instantly. - Use
axis=0for rows,axis=1for columns — common confusion when using.drop()and.apply(). - Vectorised operations are fast — avoid looping over DataFrames. Use
.apply()or built-in methods instead. locvsiloc:df.loc["label"]uses index labels;df.iloc[0]uses integer position.- Chain methods for readability:
df.dropna().groupby("City")["Revenue"].sum().sort_values(ascending=False). - Install pandas first:
pip install pandas.