Phase 8 📊 — Data Science

Topics: pandas Series, DataFrames, CSV handling, basic data analysis

Transform raw data into insights. pandas is the most popular Python library for data analysis — you'll use it to load, clean, explore, and present data.

🔄 Exercise Flow

📚 Prerequisites

Before starting these exercises, make sure you've read:

Introduction to pandas — What pandas is and how to install it
pandas Series — 1D labelled arrays
DataFrame Basics — Creating and inspecting DataFrames
Indexing & Selection — loc, iloc, and boolean filtering
CSV with pandas — Reading and writing CSV files

🔰 Starter: Series Operations Lab

Time: 12 minutes

Work with pandas Series to perform element-wise operations, filtering, and statistical calculations.

Learning Objectives

Create a Series from a list and a dictionary
Perform vectorised arithmetic operations
Use boolean indexing to filter values
Compute descriptive statistics (.mean(), .std(), .min(), .max())
Handle missing values with .isna() and .fillna()

Starter Code

# Starter: Series Operations Lab
import pandas as pd

# Daily temperatures (°C) for a week
temps = pd.Series([28.5, 30.1, 29.8, None, 31.2, 27.9, 26.5],
                  index=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])

# TODO 1: Print the Series
# TODO 2: How many days have data? (count non-null values)
# TODO 3: What is the average temperature for the week?
# TODO 4: What is the highest and lowest temperature?
# TODO 5: Fill the missing value with the average temperature
# TODO 6: Convert all temps to Fahrenheit using vectorised operations (F = C * 9/5 + 32)
# TODO 7: Find all days where temp > 30°C (after filling missing)
# TODO 8: Add 5 bonus questions of your own using Series methods

# Your code here 👇

Expected Output

Mon    28.5
Tue    30.1
Wed    29.8
Thu     NaN
Fri    31.2
Sat    27.9
Sun    26.5
dtype: float64

=== ANALYSIS ===
Non-null count: 6 / 7
Average temp:  29.0°C
Max temp:      31.2°C (Fri)
Min temp:      26.5°C (Sun)
Filled NaN:    29.0°C

Days > 30°C:
  Tue    30.1
  Fri    31.2

Temps in Fahrenheit:
  Mon    83.30
  Tue    86.18
  Wed    85.64
  ...

⭐ Medium: Sales Data Analysis with DataFrames

Time: 22 minutes

Load a sales dataset into a DataFrame, perform cleaning, grouping, aggregation, and visualisation-style analysis (print-based).

Learning Objectives

Create a DataFrame from a dictionary
Use .head(), .info(), .describe() for exploration
Filter rows with boolean conditions
Group data with .groupby() and aggregate with .agg()
Add new calculated columns
Sort and rank data

Starter Code

# Medium: Sales Data Analysis with DataFrames
import pandas as pd

# Sample sales data
sales_data = {
    "Date": [
        "2025-01-05", "2025-01-05", "2025-01-06", "2025-01-06",
        "2025-01-07", "2025-01-07", "2025-01-08", "2025-01-08"
    ],
    "Product": [
        "Laptop", "Mouse", "Laptop", "Keyboard",
        "Monitor", "Mouse", "Laptop", "Keyboard"
    ],
    "Category": [
        "Electronics", "Accessories", "Electronics", "Accessories",
        "Electronics", "Accessories", "Electronics", "Accessories"
    ],
    "Units": [3, 15, 2, 10, 5, 20, 4, 8],
    "Unit_Price": [1200, 25, 1200, 80, 300, 25, 1200, 80],
}

df = pd.DataFrame(sales_data)

# TODO 1: Add a 'Revenue' column (Units * Unit_Price)
# TODO 2: Show basic info and summary statistics
# TODO 3: Filter to show only Electronics with revenue > $1000
# TODO 4: Calculate total revenue per product
# TODO 5: Calculate total revenue per category
# TODO 6: Find the best-selling product (most units sold)
# TODO 7: Add a 'Profit_Margin' column: Electronics = 30%, Accessories = 50%
# TODO 8: Sort products by total revenue descending
# TODO 9: Create a summary report

# Your code here 👇

Expected Output

=== SALES DATA OVERVIEW ===
Shape: (8, 5)
Columns: Date, Product, Category, Units, Unit_Price

=== SUMMARY STATISTICS ===
           Units    Unit_Price     Revenue
count   8.00        8.00          8.00
mean    8.38      415.00       4950.00
std     6.46      497.71       5575.35
min     2.00       25.00        375.00
max    20.00     1200.00      15000.00

=== REVENUE BY PRODUCT ===
Laptop      $10,800.00
Mouse       $875.00
Keyboard    $1,440.00
Monitor     $1,500.00

=== REVENUE BY CATEGORY ===
Electronics   $14,100.00
Accessories   $2,315.00

=== BEST SELLERS ===
Most Units:   Mouse (35 units)
Top Revenue:  Laptop ($10,800)

🏆 Hard: Complete Data Analysis Pipeline

Time: 35 minutes

Build a complete data analysis pipeline that generates sample data, loads it into pandas, cleans it, performs multi-dimensional analysis, and exports results.

Learning Objectives

Generate realistic sample data using pandas and random data
Perform data cleaning (handle missing values, remove duplicates, fix types)
Use multi-level grouping and aggregation
Create derived metrics and calculated columns
Implement conditional logic with np.where() or .apply()
Export analysis results to CSV
Build a reusable analysis pipeline function

Starter Code

# Hard: Complete Data Analysis Pipeline
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta

# --- Generate realistic sales data ---
def generate_sales_data(n=200):
    """Generate n rows of simulated sales data."""
    products = {
        "Laptop": {"category": "Electronics", "price": 1200, "cost": 800},
        "Phone": {"category": "Electronics", "price": 800, "cost": 550},
        "Tablet": {"category": "Electronics", "price": 400, "cost": 280},
        "Mouse": {"category": "Accessories", "price": 25, "cost": 10},
        "Keyboard": {"category": "Accessories", "price": 80, "cost": 35},
        "Monitor": {"category": "Electronics", "price": 300, "cost": 200},
        "Headphones": {"category": "Accessories", "price": 150, "cost": 70},
        "USB Hub": {"category": "Accessories", "price": 35, "cost": 15},
    }
    cities = ["Mumbai", "Delhi", "Bengaluru", "Chennai", "Pune"]
    records = []
    base_date = datetime(2025, 1, 1)

    for _ in range(n):
        product_name = random.choice(list(products.keys()))
        product_info = products[product_name]
        date = base_date + timedelta(days=random.randint(0, 90))
        units = random.randint(1, 10)
        records.append({
            "Date": date.strftime("%Y-%m-%d"),
            "Product": product_name,
            "Category": product_info["category"],
            "Price": product_info["price"],
            "Cost": product_info["cost"],
            "Units": units,
            "City": random.choice(cities),
            "Rating": round(random.uniform(1, 5), 1),
        })

    df = pd.DataFrame(records)
    # Introduce some missing values and duplicates
    df.loc[random.sample(range(n), 5), "Rating"] = None
    df = pd.concat([df, df.iloc[:3]])  # Add 3 duplicates
    return df

# --- Load data ---
df = generate_sales_data(200)
print(f"Raw data: {df.shape[0]} rows, {df.shape[1]} columns")

# TODO 1: Data Cleaning
#   - Remove duplicate rows (keep first)
#   - Fill missing ratings with the average rating
#   - Add a 'Revenue' column (Units * Price)
#   - Add a 'Profit' column (Revenue - Units * Cost)
#   - Add a 'Profit_Margin_%' column

# TODO 2: Monthly Analysis
#   - Extract month from Date into a new column
#   - Calculate total revenue and profit per month

# TODO 3: City Analysis
#   - Which city generates the most revenue?
#   - Which city has the highest average rating?

# TODO 4: Product Performance
#   - Top 3 products by revenue
#   - Bottom 3 products by profit margin

# TODO 5: Category Summary
#   - Total revenue, profit, and avg margin per category

# TODO 6: Advanced Analysis
#   - Find products where profit margin > 50% AND units sold > 50
#   - Calculate the correlation between Rating and Revenue
#   - Find the best-selling product in each city

# TODO 7: Export
#   - Save the cleaned dataset to "cleaned_sales.csv"
#   - Save the monthly summary to "monthly_summary.csv"
#   - Save the city analysis to "city_analysis.csv"

# TODO 8: Wrap everything in a function: run_analysis(n_records=200)

# Your code here 👇

Expected Output

═══ DATA ANALYSIS PIPELINE ═══

[1/8] Generating data... 200 rows generated.

[2/8] Cleaning...
  Duplicates removed: 3
  Missing ratings filled: 5
  Shape: 197 × 11

[3/8] Monthly Revenue & Profit:
  Month      Revenue      Profit     Margin
  Jan       $28,450.00   $10,230.00  35.9%
  Feb       $32,100.00   $11,890.00  37.0%
  Mar       $35,670.00   $13,450.00  37.7%

[4/8] City Performance:
  Mumbai     $24,500.00  ⭐ 4.2 avg rating
  Delhi      $22,800.00  ⭐ 4.0 avg rating
  Bengaluru  $21,200.00  ⭐ 4.5 avg rating  ← Highest rated!
  Chennai    $18,900.00  ⭐ 3.8 avg rating
  Pune       $15,820.00  ⭐ 4.1 avg rating

[5/8] Top 3 Products by Revenue:
  1. Laptop     — $38,400.00
  2. Phone      — $22,400.00
  3. Monitor    — $10,200.00

[6/8] Categories:
  Electronics   Revenue: $72,100.00  Margin: 34.2%
  Accessories   Revenue: $31,870.00  Margin: 58.5%

[7/8] Best-Selling Product by City:
  Mumbai:     Laptop (25 units)
  Delhi:      Phone (20 units)
  Bengaluru:  Laptop (18 units)

[8/8] Exported:
  ✓ cleaned_sales.csv
  ✓ monthly_summary.csv
  ✓ city_analysis.csv

═══ PIPELINE COMPLETE ═══

💡 Tips for Success

Always check df.info() and df.describe() first — they tell you about data types, missing values, and basic statistics instantly.
Use axis=0 for rows, axis=1 for columns — common confusion when using .drop() and .apply().
Vectorised operations are fast — avoid looping over DataFrames. Use .apply() or built-in methods instead.
loc vs iloc: df.loc["label"] uses index labels; df.iloc[0] uses integer position.
Chain methods for readability: df.dropna().groupby("City")["Revenue"].sum().sort_values(ascending=False).
Install pandas first: pip install pandas.

← Back to Exercises

Phase 8 📊 — Data Science

🔄 Exercise Flow

📚 Prerequisites

🔰 Starter: Series Operations Lab

Learning Objectives

Starter Code

Expected Output

⭐ Medium: Sales Data Analysis with DataFrames

Learning Objectives

Starter Code

Expected Output

🏆 Hard: Complete Data Analysis Pipeline

Learning Objectives

Starter Code

Expected Output

💡 Tips for Success

📍 Visit Us

🏫 VD Computer Tuition Surat

Computer Classes & Tuition — Areas We Serve in Surat

🔄 Exercise Flow​

📚 Prerequisites​

🔰 Starter: Series Operations Lab​

Learning Objectives​

Starter Code​

Expected Output​

⭐ Medium: Sales Data Analysis with DataFrames​

Learning Objectives​

Starter Code​

Expected Output​

🏆 Hard: Complete Data Analysis Pipeline​

Learning Objectives​

Starter Code​

Expected Output​

💡 Tips for Success​

📍 Visit Us

🏫 VD Computer Tuition Surat

Computer Classes & Tuition — Areas We Serve in Surat

🔄 Exercise Flow

📚 Prerequisites

🔰 Starter: Series Operations Lab

Learning Objectives

Starter Code

Expected Output

⭐ Medium: Sales Data Analysis with DataFrames

Learning Objectives

Starter Code

Expected Output

🏆 Hard: Complete Data Analysis Pipeline

Learning Objectives

Starter Code

Expected Output

💡 Tips for Success