Skip to main content

Reading & Writing CSV Files with pandas ๐Ÿ“‚

Mentor's Note: Real-world data almost always arrives as a CSV file. pd.read_csv() will be the single most-used pandas function in your data science career. Learn it well โ€” CBSE practicals test it every year. ๐Ÿ’ก

What You'll Learn

By the end of this tutorial, you'll know:

  • How to load any CSV file into a DataFrame in one line with pd.read_csv()
  • Which parameters matter most: usecols, index_col, nrows, encoding
  • How to detect and handle missing values with isnull(), dropna(), and fillna()
  • Why you must always use index=False when saving with to_csv() โ€” and what goes wrong if you forget

๐ŸŒŸ The Scenario: The School Data Importโ€‹

Your school exports its student database as a .csv file every semester โ€” 5,000 students, 10 columns, with some missing values and extra whitespace. Opening it in Excel takes 30 seconds and crashes on older machines. With pandas:

import pandas as pd
df = pd.read_csv('students.csv')
print(df.shape) # (5000, 10) โ€” instantly
print(df.head()) # First 5 rows

That's it. The DataFrame is ready to filter, sort, clean, and export.


See Also: Built-in CSV Module

This page covers pandas CSV handling for data analysis. If you need the built-in Python csv module (no third-party library required), see Reading & Writing CSV Files โ€” built-in csv module.


๐Ÿ“– Concept Explanationโ€‹

1. pd.read_csv() โ€” Key Parametersโ€‹

ParameterPurposeExample
filepath_or_bufferPath to CSV file'students.csv'
sepDelimiter charactersep=',' (default) or sep='\t' for TSV
headerRow number(s) for column namesheader=0 (default)
index_colColumn to use as row indexindex_col='name'
usecolsLoad only specific columnsusecols=['name', 'marks']
nrowsLoad only first N rowsnrows=100
encodingFile encodingencoding='utf-8'
na_valuesValues to treat as missingna_values=['N/A', '-']

2. df.to_csv() โ€” Key Parametersโ€‹

ParameterPurposeExample
path_or_bufOutput file path'output.csv'
sepDelimitersep=',' (default)
indexWrite row index to fileAlways use index=False
encodingOutput encodingencoding='utf-8'

3. Handling Missing Data After Readingโ€‹

MethodPurpose
df.isnull().sum()Count missing values per column
df.dropna()Remove rows with any missing value
df.fillna(value)Replace missing values with value

๐ŸŽจ Visual Logicโ€‹


๐Ÿ’ป Implementationโ€‹

import pandas as pd

# Read entire CSV into a DataFrame
df = pd.read_csv('students.csv')

print(df.shape) # e.g., (5000, 10)
print(df.head()) # First 5 rows
print(df.tail(3)) # Last 3 rows
print(df.columns) # All column names
# Output:
# Index(['name', 'marks', 'city', 'grade', ...], dtype='object')

๐Ÿ“Š Comparison: Built-in csv module vs pd.read_csvโ€‹

Featurecsv.readerpd.read_csv
ReturnsList of rowsDataFrame
Column accessBy position row[0]By name df['name']
Missing data handlingManualBuilt-in
Filter rowsManual loopdf[condition]
Statistical analysisManualdf.describe()
Large filesSlowOptimised (C backend)
Requires installโŒ (built-in)โœ… pip install pandas

When to use which?

  • Use csv module for simple scripts, small files, or when you cannot install third-party packages.
  • Use pandas for data analysis, cleaning, filtering, or any file with more than a few hundred rows.

๐Ÿ“Š Sample Dry Run: What Happens When You Call pd.read_csv()?โ€‹

File students.csv (3 rows, header row):

name,marks,city
Vishnu,95,Surat
Ankit,,Mumbai
Priya,70,
StepWhat pandas doesResult
1Reads header rowColumn names: name, marks, city
2Reads row 1Vishnu, 95, Surat โ€” all values present
3Reads row 2Ankit, (empty), Mumbai โ€” marks becomes NaN
4Reads row 3Priya, 70, (empty) โ€” city becomes NaN
5Assigns default indexRangeIndex(0, 1, 2)
6Returns DataFrameshape (3, 3), marks dtype float64 (because of NaN)

Notice: a missing value forces the marks column from int64 to float64 โ€” because NaN is a float in Python. Always run df.info() after reading to check this.


๐ŸŽฏ Practice Lab ๐Ÿงชโ€‹

Task: Real CSV Analysis

Download any free CSV dataset (e.g., from data.gov.in or a sample students CSV).

Then:

  1. Load it with pd.read_csv() and print its shape.
  2. Use df.isnull().sum() to check for missing values.
  3. Drop rows with missing values using df.dropna().
  4. Filter rows based on a numeric condition (e.g., marks > 50).
  5. Save the cleaned DataFrame with df.to_csv('cleaned.csv', index=False).

Bonus: Try loading only 2 specific columns using usecols.


๐Ÿ“š Best Practices & Common Mistakesโ€‹

โœ… Best Practicesโ€‹

  • Always call df.info() right after read_csv() โ€” check dtypes, missing counts, and whether pandas guessed columns correctly before doing any analysis
  • Always use index=False in to_csv() โ€” forgetting adds an unnamed index column that corrupts the file when re-read
  • Use usecols on large files โ€” loading only the columns you need can reduce memory usage by 80% on wide datasets

โŒ Common Mistakesโ€‹

  • Forgetting index=False โ€” the saved CSV gets an extra unnamed first column (Unnamed: 0). Always check: df.to_csv('out.csv', index=False)
  • Ignoring encoding errors โ€” files from Windows often use cp1252 encoding. If you get a UnicodeDecodeError, try encoding='cp1252' or encoding='latin-1'
  • Not using .copy() after filtering โ€” df_clean = df.dropna() then df_clean['new_col'] = ... triggers SettingWithCopyWarning. Use df_clean = df.dropna().copy()

โ“ Frequently Asked Questionsโ€‹

Q: Which pandas function reads a CSV file into a DataFrame?

pd.read_csv('filename.csv') โ€” this is the standard answer for CBSE exams. The return value is a pandas DataFrame with the file's header row as column names and a default RangeIndex as the row index.

Q: Why should you always use index=False when saving with to_csv()?

Without it, pandas writes the row index (0, 1, 2...) as an extra unnamed column in the CSV. When you re-read that file, you get a column named Unnamed: 0 full of row numbers โ€” junk data. The fix: df.to_csv('out.csv', index=False).

Q: What does df.isnull().sum() return?

A Series showing the count of missing (NaN) values in each column. For example, if marks has 3 missing values, df.isnull().sum()['marks'] returns 3. Compare this to the total row count (len(df)) to understand how much data is missing.

Q: What is the difference between dropna() and fillna()?

dropna() removes rows (or columns) that contain any missing value โ€” you lose data. fillna(value) replaces missing values with a specified value โ€” you keep all rows. Use dropna() when missing rows are invalid; use fillna() when you can substitute a sensible default (e.g., 0 for marks, 'Unknown' for city).

Q: How do you load only specific columns from a large CSV?

Use the usecols parameter: pd.read_csv('file.csv', usecols=['col1', 'col2']). pandas reads only those columns into memory โ€” much faster and cheaper for wide datasets with 50+ columns.


โœ… Summaryโ€‹

In this tutorial, you've learned:

  • โœ… pd.read_csv('file.csv') loads any CSV into a DataFrame in one line
  • โœ… Use usecols, nrows, and index_col to control what gets loaded
  • โœ… df.isnull().sum() reveals missing values; dropna() removes them; fillna() replaces them
  • โœ… Always df.to_csv('out.csv', index=False) โ€” never forget index=False
  • โœ… Run df.info() immediately after reading โ€” missing values change column dtypes from int64 to float64

๐Ÿ’ก Interview & Exam Tipsโ€‹

Q: Which pandas function reads a CSV file into a DataFrame?

pd.read_csv('filename.csv') โ€” standard CBSE answer. Returns a DataFrame with header row as column names.

Q: Why should you always use index=False when saving with to_csv()?

Without it, pandas writes the row index as an extra unnamed column, creating junk data when the file is re-read.

Q: What does df.isnull().sum() return?

A Series showing the count of missing (NaN) values in each column.

Q: What is the difference between dropna() and fillna()?

dropna() removes rows with missing values. fillna() replaces missing values with a specified value.


๐Ÿ“š Further Readingโ€‹

Continue your learning path:

Go deeper: