Reading & Writing CSV Files with pandas πΒΆ
Prerequisites: Indexing and Selection, Python CSV Basics
Mentor's Note: Real-world data almost always arrives as a CSV file.
pd.read_csv()will be the single most-used pandas function in your data science career. Learn it well β CBSE practicals test it every year. π‘
What You'll Learn
By the end of this tutorial, you'll know:
- How to load any CSV file into a DataFrame in one line with
pd.read_csv() - Which parameters matter most:
usecols,index_col,nrows,encoding - How to detect and handle missing values with
isnull(),dropna(), andfillna() - Why you must always use
index=Falsewhen saving withto_csv()β and what goes wrong if you forget
π The Scenario: The School Data ImportΒΆ
Your school exports its student database as a .csv file every semester β 5,000 students, 10 columns, with some missing values and extra whitespace. Opening it in Excel takes 30 seconds and crashes on older machines. With pandas:
import pandas as pd
df = pd.read_csv('students.csv')
print(df.shape) # (5000, 10) β instantly
print(df.head()) # First 5 rows
That's it. The DataFrame is ready to filter, sort, clean, and export.
See Also: Built-in CSV Module
This page covers pandas CSV handling for data analysis. If you need the built-in Python csv module (no third-party library required), see Reading & Writing CSV Files β built-in csv module.
π Concept ExplanationΒΆ
1. pd.read_csv() β Key ParametersΒΆ
| Parameter | Purpose | Example |
|---|---|---|
filepath_or_buffer |
Path to CSV file | 'students.csv' |
sep |
Delimiter character | sep=',' (default) or sep='\t' for TSV |
header |
Row number(s) for column names | header=0 (default) |
index_col |
Column to use as row index | index_col='name' |
usecols |
Load only specific columns | usecols=['name', 'marks'] |
nrows |
Load only first N rows | nrows=100 |
encoding |
File encoding | encoding='utf-8' |
na_values |
Values to treat as missing | na_values=['N/A', '-'] |
2. df.to_csv() β Key ParametersΒΆ
| Parameter | Purpose | Example |
|---|---|---|
path_or_buf |
Output file path | 'output.csv' |
sep |
Delimiter | sep=',' (default) |
index |
Write row index to file | Always use index=False |
encoding |
Output encoding | encoding='utf-8' |
3. Handling Missing Data After ReadingΒΆ
| Method | Purpose |
|---|---|
df.isnull().sum() |
Count missing values per column |
df.dropna() |
Remove rows with any missing value |
df.fillna(value) |
Replace missing values with value |
π¨ Visual LogicΒΆ
graph LR
A["students.csv"] --> B["pd.read_csv()"]
B --> C["DataFrame"]
C --> D["Clean / Analyse"]
D --> E["df.to_csv()"]
E --> F["cleaned_students.csv"]
π» ImplementationΒΆ
import pandas as pd
# Load only specific columns β faster on large files
df = pd.read_csv(
'students.csv',
usecols=['name', 'marks', 'city'],
nrows=100, # Load only first 100 rows
encoding='utf-8' # Specify encoding
)
# Use a column as the row index
df2 = pd.read_csv('students.csv', index_col='roll_no')
print(df2.loc[101]) # Access row by roll number
# Output:
# name Vishnu
# marks 95
# city Surat
# Name: 101, dtype: object
import pandas as pd
df = pd.DataFrame({
'name': ['Vishnu', 'Ankit', 'Priya'],
'marks': [95, 82, 70],
'grade': ['A', 'B', 'C']
})
# Save to CSV β always use index=False!
df.to_csv('results.csv', index=False)
# Without index=False, pandas writes an extra unnamed column:
# ,name,marks,grade β unwanted!
# With index=False:
# name,marks,grade β clean!
import pandas as pd
df = pd.read_csv('students.csv')
# Check how many values are missing per column
print(df.isnull().sum())
# Output:
# name 0
# marks 3 β 3 students have no marks recorded
# city 7
# dtype: int64
# Option 1: Drop rows with ANY missing value
df_clean = df.dropna()
# Option 2: Fill missing marks with 0
df_filled = df.fillna({'marks': 0, 'city': 'Unknown'})
# Save cleaned version
df_clean.to_csv('students_cleaned.csv', index=False)
Build a small in-memory CSV and round-trip it β no file needed on disk.
>>> import pandas as pd
>>> import io
>>> csv_data = """name,marks,city
... Vishnu,95,Surat
... Ankit,,Mumbai
... Priya,70,
... Sara,88,Delhi"""
>>> df = pd.read_csv(io.StringIO(csv_data))
>>> df
name marks city
0 Vishnu 95.0 Surat
1 Ankit NaN Mumbai
2 Priya 70.0 None
3 Sara 88.0 Delhi
>>> df.isnull().sum()
name 0
marks 1
city 1
dtype: int64
>>> df.fillna({'marks': 0, 'city': 'Unknown'})
name marks city
0 Vishnu 95.0 Surat
1 Ankit 0.0 Mumbai
2 Priya 70.0 Unknown
3 Sara 88.0 Delhi
>>> df.dropna()
name marks city
0 Vishnu 95.0 Surat
3 Sara 88.0 Delhi
New to the REPL?
Type python3 in your terminal. io.StringIO lets you simulate a CSV file from a string β perfect for experimenting without needing a real file on disk. Type exit() to quit.
π Comparison: Built-in csv module vs pd.read_csvΒΆ
| Feature | csv.reader |
pd.read_csv |
|---|---|---|
| Returns | List of rows | DataFrame |
| Column access | By position row[0] |
By name df['name'] |
| Missing data handling | Manual | Built-in |
| Filter rows | Manual loop | df[condition] |
| Statistical analysis | Manual | df.describe() |
| Large files | Slow | Optimised (C backend) |
| Requires install | β (built-in) | β
pip install pandas |
When to use which?
- Use
csvmodule for simple scripts, small files, or when you cannot install third-party packages. - Use
pandasfor data analysis, cleaning, filtering, or any file with more than a few hundred rows.
π Sample Dry Run: What Happens When You Call pd.read_csv()?ΒΆ
File students.csv (3 rows, header row):
| Step | What pandas does | Result |
|---|---|---|
| 1 | Reads header row | Column names: name, marks, city |
| 2 | Reads row 1 | Vishnu, 95, Surat β all values present |
| 3 | Reads row 2 | Ankit, (empty), Mumbai β marks becomes NaN |
| 4 | Reads row 3 | Priya, 70, (empty) β city becomes NaN |
| 5 | Assigns default index | RangeIndex(0, 1, 2) |
| 6 | Returns DataFrame | shape (3, 3), marks dtype float64 (because of NaN) |
Notice: a missing value forces the marks column from int64 to float64 β because NaN is a float in Python. Always run df.info() after reading to check this.
π― Practice Lab π§ͺΒΆ
Task: Real CSV Analysis
Download any free CSV dataset (e.g., from data.gov.in or a sample students CSV).
Then:
- Load it with
pd.read_csv()and print itsshape. - Use
df.isnull().sum()to check for missing values. - Drop rows with missing values using
df.dropna(). - Filter rows based on a numeric condition (e.g., marks > 50).
- Save the cleaned DataFrame with
df.to_csv('cleaned.csv', index=False).
Bonus: Try loading only 2 specific columns using usecols.
π Best Practices & Common MistakesΒΆ
β Best PracticesΒΆ
- Always call
df.info()right afterread_csv()β check dtypes, missing counts, and whether pandas guessed columns correctly before doing any analysis - Always use
index=Falseinto_csv()β forgetting adds an unnamed index column that corrupts the file when re-read - Use
usecolson large files β loading only the columns you need can reduce memory usage by 80% on wide datasets
β Common MistakesΒΆ
- Forgetting
index=Falseβ the saved CSV gets an extra unnamed first column (Unnamed: 0). Always check:df.to_csv('out.csv', index=False) - Ignoring encoding errors β files from Windows often use
cp1252encoding. If you get aUnicodeDecodeError, tryencoding='cp1252'orencoding='latin-1' - Not using
.copy()after filtering βdf_clean = df.dropna()thendf_clean['new_col'] = ...triggersSettingWithCopyWarning. Usedf_clean = df.dropna().copy()
β Frequently Asked QuestionsΒΆ
Q: Which pandas function reads a CSV file into a DataFrame?
pd.read_csv('filename.csv') β this is the standard answer for CBSE exams. The return value is a pandas DataFrame with the file's header row as column names and a default RangeIndex as the row index.
Q: Why should you always use index=False when saving with to_csv()?
Without it, pandas writes the row index (0, 1, 2...) as an extra unnamed column in the CSV. When you re-read that file, you get a column named Unnamed: 0 full of row numbers β junk data. The fix: df.to_csv('out.csv', index=False).
Q: What does df.isnull().sum() return?
A Series showing the count of missing (NaN) values in each column. For example, if marks has 3 missing values, df.isnull().sum()['marks'] returns 3. Compare this to the total row count (len(df)) to understand how much data is missing.
Q: What is the difference between dropna() and fillna()?
dropna() removes rows (or columns) that contain any missing value β you lose data. fillna(value) replaces missing values with a specified value β you keep all rows. Use dropna() when missing rows are invalid; use fillna() when you can substitute a sensible default (e.g., 0 for marks, 'Unknown' for city).
Q: How do you load only specific columns from a large CSV?
Use the usecols parameter: pd.read_csv('file.csv', usecols=['col1', 'col2']). pandas reads only those columns into memory β much faster and cheaper for wide datasets with 50+ columns.
β SummaryΒΆ
In this tutorial, you've learned:
- β
pd.read_csv('file.csv')loads any CSV into a DataFrame in one line - β
Use
usecols,nrows, andindex_colto control what gets loaded - β
df.isnull().sum()reveals missing values;dropna()removes them;fillna()replaces them - β
Always
df.to_csv('out.csv', index=False)β never forgetindex=False - β
Run
df.info()immediately after reading β missing values change column dtypes fromint64tofloat64
π‘ Interview & Exam TipsΒΆ
Q: Which pandas function reads a CSV file into a DataFrame?
pd.read_csv('filename.csv') β standard CBSE answer. Returns a DataFrame with header row as column names.
Q: Why should you always use index=False when saving with to_csv()?
Without it, pandas writes the row index as an extra unnamed column, creating junk data when the file is re-read.
Q: What does df.isnull().sum() return?
A Series showing the count of missing (NaN) values in each column.
Q: What is the difference between dropna() and fillna()?
dropna() removes rows with missing values. fillna() replaces missing values with a specified value.
π Further ReadingΒΆ
Continue your learning path:
- β Indexing & Selection β apply
loc,iloc, and filtering on real CSV data - β Data Science Roadmap β overview of the full pandas learning path
Go deeper:
- Official pandas docs β IO tools β full reference for all read/write functions
- DataFrame Basics β revisit
df.info()anddf.describe()to understand your loaded data - Indexing & Selection β filter and slice your CSV data after loading