Reading & Writing CSV Files with pandas 📂

Mentor's Note: Real-world data almost always arrives as a CSV file. pd.read_csv() will be the single most-used pandas function in your data science career. Learn it well — CBSE practicals test it every year. 💡

What You'll Learn

By the end of this tutorial, you'll know:

How to load any CSV file into a DataFrame in one line with pd.read_csv()
Which parameters matter most: usecols, index_col, nrows, encoding
How to detect and handle missing values with isnull(), dropna(), and fillna()
Why you must always use index=False when saving with to_csv() — and what goes wrong if you forget

🌟 The Scenario: The School Data Import

Your school exports its student database as a .csv file every semester — 5,000 students, 10 columns, with some missing values and extra whitespace. Opening it in Excel takes 30 seconds and crashes on older machines. With pandas:

import pandas as pd
df = pd.read_csv('students.csv')
print(df.shape)        # (5000, 10) — instantly
print(df.head())       # First 5 rows

That's it. The DataFrame is ready to filter, sort, clean, and export.

📖 Concept Explanation

1. `pd.read_csv()` — Key Parameters

Parameter	Purpose	Example
`filepath_or_buffer`	Path to CSV file	`'students.csv'`
`sep`	Delimiter character	`sep=','` (default) or `sep='\t'` for TSV
`header`	Row number(s) for column names	`header=0` (default)
`index_col`	Column to use as row index	`index_col='name'`
`usecols`	Load only specific columns	`usecols=['name', 'marks']`
`nrows`	Load only first N rows	`nrows=100`
`encoding`	File encoding	`encoding='utf-8'`
`na_values`	Values to treat as missing	`na_values=['N/A', '-']`

2. `df.to_csv()` — Key Parameters

Parameter	Purpose	Example
`path_or_buf`	Output file path	`'output.csv'`
`sep`	Delimiter	`sep=','` (default)
`index`	Write row index to file	Always use `index=False`
`encoding`	Output encoding	`encoding='utf-8'`

3. Handling Missing Data After Reading

Method	Purpose
`df.isnull().sum()`	Count missing values per column
`df.dropna()`	Remove rows with any missing value
`df.fillna(value)`	Replace missing values with `value`

🎨 Visual Logic

💻 Implementation

1. Basic Read
2. Read with Options
3. Write CSV
4. Handle Missing Data
5. Interactive REPL

import pandas as pd

# Read entire CSV into a DataFrame
df = pd.read_csv('students.csv')

print(df.shape)     # e.g., (5000, 10)
print(df.head())    # First 5 rows
print(df.tail(3))   # Last 3 rows
print(df.columns)   # All column names
# Output:
# Index(['name', 'marks', 'city', 'grade', ...], dtype='object')

import pandas as pd

# Load only specific columns — faster on large files
df = pd.read_csv(
    'students.csv',
    usecols=['name', 'marks', 'city'],
    nrows=100,           # Load only first 100 rows
    encoding='utf-8'     # Specify encoding
)

# Use a column as the row index
df2 = pd.read_csv('students.csv', index_col='roll_no')
print(df2.loc[101])     # Access row by roll number
# Output:
# name      Vishnu
# marks         95
# city       Surat
# Name: 101, dtype: object

import pandas as pd

df = pd.DataFrame({
    'name':   ['Vishnu', 'Ankit', 'Priya'],
    'marks':  [95, 82, 70],
    'grade':  ['A', 'B', 'C']
})

# Save to CSV — always use index=False!
df.to_csv('results.csv', index=False)

# Without index=False, pandas writes an extra unnamed column:
# ,name,marks,grade   ← unwanted!
# With index=False:
# name,marks,grade    ← clean!

import pandas as pd

df = pd.read_csv('students.csv')

# Check how many values are missing per column
print(df.isnull().sum())
# Output:
# name       0
# marks      3    ← 3 students have no marks recorded
# city       7
# dtype: int64

# Option 1: Drop rows with ANY missing value
df_clean = df.dropna()

# Option 2: Fill missing marks with 0
df_filled = df.fillna({'marks': 0, 'city': 'Unknown'})

# Save cleaned version
df_clean.to_csv('students_cleaned.csv', index=False)

Build a small in-memory CSV and round-trip it — no file needed on disk.

>>> import pandas as pd
>>> import io
>>> csv_data = """name,marks,city
... Vishnu,95,Surat
... Ankit,,Mumbai
... Priya,70,
... Sara,88,Delhi"""
>>> df = pd.read_csv(io.StringIO(csv_data))
>>> df
     name  marks    city
0  Vishnu   95.0   Surat
1   Ankit    NaN  Mumbai
2   Priya   70.0    None
3    Sara   88.0   Delhi
>>> df.isnull().sum()
name     0
marks    1
city     1
dtype: int64
>>> df.fillna({'marks': 0, 'city': 'Unknown'})
     name  marks     city
0  Vishnu   95.0    Surat
1   Ankit    0.0   Mumbai
2   Priya   70.0  Unknown
3    Sara   88.0    Delhi
>>> df.dropna()
     name  marks   city
0  Vishnu   95.0  Surat
3    Sara   88.0  Delhi

New to the REPL?

Type python3 in your terminal. io.StringIO lets you simulate a CSV file from a string — perfect for experimenting without needing a real file on disk. Type exit() to quit.

📊 Comparison: Built-in `csv` module vs `pd.read_csv`

Feature	`csv.reader`	`pd.read_csv`
Returns	List of rows	DataFrame
Column access	By position `row[0]`	By name `df['name']`
Missing data handling	Manual	Built-in
Filter rows	Manual loop	`df[condition]`
Statistical analysis	Manual	`df.describe()`
Large files	Slow	Optimised (C backend)
Requires install	❌ (built-in)	✅ `pip install pandas`

When to use which?

Use csv module for simple scripts, small files, or when you cannot install third-party packages.
Use pandas for data analysis, cleaning, filtering, or any file with more than a few hundred rows.

📊 Sample Dry Run: What Happens When You Call `pd.read_csv()`?

File students.csv (3 rows, header row):

name,marks,city
Vishnu,95,Surat
Ankit,,Mumbai
Priya,70,

Step	What pandas does	Result
1	Reads header row	Column names: `name`, `marks`, `city`
2	Reads row 1	`Vishnu`, `95`, `Surat` — all values present
3	Reads row 2	`Ankit`, (empty), `Mumbai` — marks becomes `NaN`
4	Reads row 3	`Priya`, `70`, (empty) — city becomes `NaN`
5	Assigns default index	`RangeIndex(0, 1, 2)`
6	Returns DataFrame	shape `(3, 3)`, `marks` dtype `float64` (because of NaN)

Notice: a missing value forces the marks column from int64 to float64 — because NaN is a float in Python. Always run df.info() after reading to check this.

🎯 Practice Lab 🧪

Task: Real CSV Analysis

Download any free CSV dataset (e.g., from data.gov.in or a sample students CSV).

Then:

Load it with pd.read_csv() and print its shape.
Use df.isnull().sum() to check for missing values.
Drop rows with missing values using df.dropna().
Filter rows based on a numeric condition (e.g., marks > 50).
Save the cleaned DataFrame with df.to_csv('cleaned.csv', index=False).

Bonus: Try loading only 2 specific columns using usecols.

📚 Best Practices & Common Mistakes

✅ Best Practices

Always call df.info() right after read_csv() — check dtypes, missing counts, and whether pandas guessed columns correctly before doing any analysis
Always use index=False in to_csv() — forgetting adds an unnamed index column that corrupts the file when re-read
Use usecols on large files — loading only the columns you need can reduce memory usage by 80% on wide datasets

❌ Common Mistakes

Forgetting index=False — the saved CSV gets an extra unnamed first column (Unnamed: 0). Always check: df.to_csv('out.csv', index=False)
Ignoring encoding errors — files from Windows often use cp1252 encoding. If you get a UnicodeDecodeError, try encoding='cp1252' or encoding='latin-1'
Not using .copy() after filtering — df_clean = df.dropna() then df_clean['new_col'] = ... triggers SettingWithCopyWarning. Use df_clean = df.dropna().copy()

❓ Frequently Asked Questions

Q: Which pandas function reads a CSV file into a DataFrame?

pd.read_csv('filename.csv') — this is the standard answer for CBSE exams. The return value is a pandas DataFrame with the file's header row as column names and a default RangeIndex as the row index.

Q: Why should you always use index=False when saving with to_csv()?

Without it, pandas writes the row index (0, 1, 2...) as an extra unnamed column in the CSV. When you re-read that file, you get a column named Unnamed: 0 full of row numbers — junk data. The fix: df.to_csv('out.csv', index=False).

Q: What does df.isnull().sum() return?

A Series showing the count of missing (NaN) values in each column. For example, if marks has 3 missing values, df.isnull().sum()['marks'] returns 3. Compare this to the total row count (len(df)) to understand how much data is missing.

Q: What is the difference between dropna() and fillna()?

dropna() removes rows (or columns) that contain any missing value — you lose data. fillna(value) replaces missing values with a specified value — you keep all rows. Use dropna() when missing rows are invalid; use fillna() when you can substitute a sensible default (e.g., 0 for marks, 'Unknown' for city).

Q: How do you load only specific columns from a large CSV?

Use the usecols parameter: pd.read_csv('file.csv', usecols=['col1', 'col2']). pandas reads only those columns into memory — much faster and cheaper for wide datasets with 50+ columns.

✅ Summary

In this tutorial, you've learned:

✅ pd.read_csv('file.csv') loads any CSV into a DataFrame in one line
✅ Use usecols, nrows, and index_col to control what gets loaded
✅ df.isnull().sum() reveals missing values; dropna() removes them; fillna() replaces them
✅ Always df.to_csv('out.csv', index=False) — never forget index=False
✅ Run df.info() immediately after reading — missing values change column dtypes from int64 to float64

💡 Interview & Exam Tips

Q: Which pandas function reads a CSV file into a DataFrame?

pd.read_csv('filename.csv') — standard CBSE answer. Returns a DataFrame with header row as column names.

Q: Why should you always use index=False when saving with to_csv()?

Without it, pandas writes the row index as an extra unnamed column, creating junk data when the file is re-read.

Q: What does df.isnull().sum() return?

A Series showing the count of missing (NaN) values in each column.

Q: What is the difference between dropna() and fillna()?

dropna() removes rows with missing values. fillna() replaces missing values with a specified value.

📚 Further Reading

Continue your learning path:

← Indexing & Selection — apply loc, iloc, and filtering on real CSV data
↑ Data Science Roadmap — overview of the full pandas learning path

Go deeper:

Official pandas docs — IO tools — full reference for all read/write functions
DataFrame Basics — revisit df.info() and df.describe() to understand your loaded data
Indexing & Selection — filter and slice your CSV data after loading

🌟 The Scenario: The School Data Import​

📖 Concept Explanation​

1. pd.read_csv() — Key Parameters​

2. df.to_csv() — Key Parameters​

3. Handling Missing Data After Reading​

🎨 Visual Logic​

💻 Implementation​

📊 Comparison: Built-in csv module vs pd.read_csv​

📊 Sample Dry Run: What Happens When You Call pd.read_csv()?​

🎯 Practice Lab 🧪​

📚 Best Practices & Common Mistakes​

✅ Best Practices​

❌ Common Mistakes​

❓ Frequently Asked Questions​

✅ Summary​

💡 Interview & Exam Tips​

📚 Further Reading​

🔗 Related Topics​