Skip to content

Reading & Writing CSV Files with pandas πŸ“‚ΒΆ

Python Professional PathData Science with pandas

Prerequisites: Indexing and Selection, Python CSV Basics

Mentor's Note: Real-world data almost always arrives as a CSV file. pd.read_csv() will be the single most-used pandas function in your data science career. Learn it well β€” CBSE practicals test it every year. πŸ’‘

What You'll Learn

By the end of this tutorial, you'll know:

  • How to load any CSV file into a DataFrame in one line with pd.read_csv()
  • Which parameters matter most: usecols, index_col, nrows, encoding
  • How to detect and handle missing values with isnull(), dropna(), and fillna()
  • Why you must always use index=False when saving with to_csv() β€” and what goes wrong if you forget

🌟 The Scenario: The School Data Import¢

Your school exports its student database as a .csv file every semester β€” 5,000 students, 10 columns, with some missing values and extra whitespace. Opening it in Excel takes 30 seconds and crashes on older machines. With pandas:

import pandas as pd
df = pd.read_csv('students.csv')
print(df.shape)        # (5000, 10) β€” instantly
print(df.head())       # First 5 rows

That's it. The DataFrame is ready to filter, sort, clean, and export.


See Also: Built-in CSV Module

This page covers pandas CSV handling for data analysis. If you need the built-in Python csv module (no third-party library required), see Reading & Writing CSV Files β€” built-in csv module.


πŸ“– Concept ExplanationΒΆ

1. pd.read_csv() β€” Key ParametersΒΆ

Parameter Purpose Example
filepath_or_buffer Path to CSV file 'students.csv'
sep Delimiter character sep=',' (default) or sep='\t' for TSV
header Row number(s) for column names header=0 (default)
index_col Column to use as row index index_col='name'
usecols Load only specific columns usecols=['name', 'marks']
nrows Load only first N rows nrows=100
encoding File encoding encoding='utf-8'
na_values Values to treat as missing na_values=['N/A', '-']

2. df.to_csv() β€” Key ParametersΒΆ

Parameter Purpose Example
path_or_buf Output file path 'output.csv'
sep Delimiter sep=',' (default)
index Write row index to file Always use index=False
encoding Output encoding encoding='utf-8'

3. Handling Missing Data After ReadingΒΆ

Method Purpose
df.isnull().sum() Count missing values per column
df.dropna() Remove rows with any missing value
df.fillna(value) Replace missing values with value

🎨 Visual Logic¢

graph LR
    A["students.csv"] --> B["pd.read_csv()"]
    B --> C["DataFrame"]
    C --> D["Clean / Analyse"]
    D --> E["df.to_csv()"]
    E --> F["cleaned_students.csv"]

πŸ’» ImplementationΒΆ

import pandas as pd

# Read entire CSV into a DataFrame
df = pd.read_csv('students.csv')

print(df.shape)     # e.g., (5000, 10)
print(df.head())    # First 5 rows
print(df.tail(3))   # Last 3 rows
print(df.columns)   # All column names
# Output:
# Index(['name', 'marks', 'city', 'grade', ...], dtype='object')
import pandas as pd

# Load only specific columns β€” faster on large files
df = pd.read_csv(
    'students.csv',
    usecols=['name', 'marks', 'city'],
    nrows=100,           # Load only first 100 rows
    encoding='utf-8'     # Specify encoding
)

# Use a column as the row index
df2 = pd.read_csv('students.csv', index_col='roll_no')
print(df2.loc[101])     # Access row by roll number
# Output:
# name      Vishnu
# marks         95
# city       Surat
# Name: 101, dtype: object
import pandas as pd

df = pd.DataFrame({
    'name':   ['Vishnu', 'Ankit', 'Priya'],
    'marks':  [95, 82, 70],
    'grade':  ['A', 'B', 'C']
})

# Save to CSV β€” always use index=False!
df.to_csv('results.csv', index=False)

# Without index=False, pandas writes an extra unnamed column:
# ,name,marks,grade   ← unwanted!
# With index=False:
# name,marks,grade    ← clean!
import pandas as pd

df = pd.read_csv('students.csv')

# Check how many values are missing per column
print(df.isnull().sum())
# Output:
# name       0
# marks      3    ← 3 students have no marks recorded
# city       7
# dtype: int64

# Option 1: Drop rows with ANY missing value
df_clean = df.dropna()

# Option 2: Fill missing marks with 0
df_filled = df.fillna({'marks': 0, 'city': 'Unknown'})

# Save cleaned version
df_clean.to_csv('students_cleaned.csv', index=False)

Build a small in-memory CSV and round-trip it β€” no file needed on disk.

>>> import pandas as pd
>>> import io
>>> csv_data = """name,marks,city
... Vishnu,95,Surat
... Ankit,,Mumbai
... Priya,70,
... Sara,88,Delhi"""
>>> df = pd.read_csv(io.StringIO(csv_data))
>>> df
     name  marks    city
0  Vishnu   95.0   Surat
1   Ankit    NaN  Mumbai
2   Priya   70.0    None
3    Sara   88.0   Delhi
>>> df.isnull().sum()
name     0
marks    1
city     1
dtype: int64
>>> df.fillna({'marks': 0, 'city': 'Unknown'})
     name  marks     city
0  Vishnu   95.0    Surat
1   Ankit    0.0   Mumbai
2   Priya   70.0  Unknown
3    Sara   88.0    Delhi
>>> df.dropna()
     name  marks   city
0  Vishnu   95.0  Surat
3    Sara   88.0  Delhi

New to the REPL?

Type python3 in your terminal. io.StringIO lets you simulate a CSV file from a string β€” perfect for experimenting without needing a real file on disk. Type exit() to quit.


πŸ“Š Comparison: Built-in csv module vs pd.read_csvΒΆ

Feature csv.reader pd.read_csv
Returns List of rows DataFrame
Column access By position row[0] By name df['name']
Missing data handling Manual Built-in
Filter rows Manual loop df[condition]
Statistical analysis Manual df.describe()
Large files Slow Optimised (C backend)
Requires install ❌ (built-in) βœ… pip install pandas

When to use which?

  • Use csv module for simple scripts, small files, or when you cannot install third-party packages.
  • Use pandas for data analysis, cleaning, filtering, or any file with more than a few hundred rows.

πŸ“Š Sample Dry Run: What Happens When You Call pd.read_csv()?ΒΆ

File students.csv (3 rows, header row):

name,marks,city
Vishnu,95,Surat
Ankit,,Mumbai
Priya,70,
Step What pandas does Result
1 Reads header row Column names: name, marks, city
2 Reads row 1 Vishnu, 95, Surat β€” all values present
3 Reads row 2 Ankit, (empty), Mumbai β€” marks becomes NaN
4 Reads row 3 Priya, 70, (empty) β€” city becomes NaN
5 Assigns default index RangeIndex(0, 1, 2)
6 Returns DataFrame shape (3, 3), marks dtype float64 (because of NaN)

Notice: a missing value forces the marks column from int64 to float64 β€” because NaN is a float in Python. Always run df.info() after reading to check this.


🎯 Practice Lab πŸ§ͺΒΆ

Task: Real CSV Analysis

Download any free CSV dataset (e.g., from data.gov.in or a sample students CSV).

Then:

  1. Load it with pd.read_csv() and print its shape.
  2. Use df.isnull().sum() to check for missing values.
  3. Drop rows with missing values using df.dropna().
  4. Filter rows based on a numeric condition (e.g., marks > 50).
  5. Save the cleaned DataFrame with df.to_csv('cleaned.csv', index=False).

Bonus: Try loading only 2 specific columns using usecols.


πŸ“š Best Practices & Common MistakesΒΆ

βœ… Best PracticesΒΆ

  • Always call df.info() right after read_csv() β€” check dtypes, missing counts, and whether pandas guessed columns correctly before doing any analysis
  • Always use index=False in to_csv() β€” forgetting adds an unnamed index column that corrupts the file when re-read
  • Use usecols on large files β€” loading only the columns you need can reduce memory usage by 80% on wide datasets

❌ Common Mistakes¢

  • Forgetting index=False β€” the saved CSV gets an extra unnamed first column (Unnamed: 0). Always check: df.to_csv('out.csv', index=False)
  • Ignoring encoding errors β€” files from Windows often use cp1252 encoding. If you get a UnicodeDecodeError, try encoding='cp1252' or encoding='latin-1'
  • Not using .copy() after filtering β€” df_clean = df.dropna() then df_clean['new_col'] = ... triggers SettingWithCopyWarning. Use df_clean = df.dropna().copy()

❓ Frequently Asked QuestionsΒΆ

Q: Which pandas function reads a CSV file into a DataFrame?

pd.read_csv('filename.csv') β€” this is the standard answer for CBSE exams. The return value is a pandas DataFrame with the file's header row as column names and a default RangeIndex as the row index.

Q: Why should you always use index=False when saving with to_csv()?

Without it, pandas writes the row index (0, 1, 2...) as an extra unnamed column in the CSV. When you re-read that file, you get a column named Unnamed: 0 full of row numbers β€” junk data. The fix: df.to_csv('out.csv', index=False).

Q: What does df.isnull().sum() return?

A Series showing the count of missing (NaN) values in each column. For example, if marks has 3 missing values, df.isnull().sum()['marks'] returns 3. Compare this to the total row count (len(df)) to understand how much data is missing.

Q: What is the difference between dropna() and fillna()?

dropna() removes rows (or columns) that contain any missing value β€” you lose data. fillna(value) replaces missing values with a specified value β€” you keep all rows. Use dropna() when missing rows are invalid; use fillna() when you can substitute a sensible default (e.g., 0 for marks, 'Unknown' for city).

Q: How do you load only specific columns from a large CSV?

Use the usecols parameter: pd.read_csv('file.csv', usecols=['col1', 'col2']). pandas reads only those columns into memory β€” much faster and cheaper for wide datasets with 50+ columns.


βœ… SummaryΒΆ

In this tutorial, you've learned:

  • βœ… pd.read_csv('file.csv') loads any CSV into a DataFrame in one line
  • βœ… Use usecols, nrows, and index_col to control what gets loaded
  • βœ… df.isnull().sum() reveals missing values; dropna() removes them; fillna() replaces them
  • βœ… Always df.to_csv('out.csv', index=False) β€” never forget index=False
  • βœ… Run df.info() immediately after reading β€” missing values change column dtypes from int64 to float64

πŸ’‘ Interview & Exam TipsΒΆ

Q: Which pandas function reads a CSV file into a DataFrame?

pd.read_csv('filename.csv') β€” standard CBSE answer. Returns a DataFrame with header row as column names.

Q: Why should you always use index=False when saving with to_csv()?

Without it, pandas writes the row index as an extra unnamed column, creating junk data when the file is re-read.

Q: What does df.isnull().sum() return?

A Series showing the count of missing (NaN) values in each column.

Q: What is the difference between dropna() and fillna()?

dropna() removes rows with missing values. fillna() replaces missing values with a specified value.


πŸ“š Further ReadingΒΆ

Continue your learning path:

Go deeper: