DataFrame Basics ποΈΒΆ
Prerequisites: pandas Series, Python Dictionaries
Mentor's Note: A DataFrame is the heart of pandas β think of it as a Python-powered spreadsheet that never crashes, never freezes, and can hold millions of rows without a sweat. Once you master DataFrames, you can analyse anything. π‘
What You'll Learn
By the end of this tutorial, you'll know:
- What a DataFrame is β rows, columns, index, and dtypes explained clearly
- How to create a DataFrame from a dictionary, a list of lists, and a CSV file
- How to inspect a DataFrame with
head(),info(),describe(), andshape - How to add and drop columns β and the
SettingWithCopyWarningtrap to avoid
π The Scenario: The Student Report CardΒΆ
Imagine your school's term report card:
- Rows = each student (Vishnu, Ankit, Priya...)
- Columns = each subject (Maths, English, Science...)
- Each cell = one mark
That entire report card β all students, all subjects, all marks β is one pandas DataFrame. You can sort it, filter it, summarise it, and export it in seconds.
π Concept ExplanationΒΆ
1. What is a DataFrame?ΒΆ
A DataFrame is a two-dimensional, labelled data structure β like a table with:
- Index β row labels (default: 0, 1, 2... or custom)
- Columns β column labels (strings, like
'name','marks') - Values β the actual data in each cell
Each column in a DataFrame is a Series. So a DataFrame is a collection of Series that share the same index.
2. Data Types (dtypes)ΒΆ
Each column can have a different dtype:
| dtype | Python equivalent | Example values |
|---|---|---|
int64 |
int |
95, 82, 0 |
float64 |
float |
3.14, 82.5 |
object |
str |
'Vishnu', 'Surat' |
bool |
bool |
True, False |
datetime64 |
datetime |
2024-01-15 |
3. Key AttributesΒΆ
| Attribute | What it returns |
|---|---|
df.shape |
Tuple (rows, columns) |
df.columns |
Column names |
df.index |
Row labels |
df.dtypes |
dtype of each column |
df.size |
Total number of cells |
π¨ Visual LogicΒΆ
graph TD
A["Python Dictionary / List of Lists"] --> B["pd.DataFrame()"]
B --> C["DataFrame\n(rows Γ columns table)"]
C --> D["df.index\n(row labels)"]
C --> E["df.columns\n(column names)"]
C --> F["df.dtypes\n(type per column)"]
π» ImplementationΒΆ
import pandas as pd
# Each key = column name, value = list of column values
data = {
'name': ['Vishnu', 'Ankit', 'Priya'],
'marks': [95, 82, 70],
'grade': ['A', 'B', 'C']
}
df = pd.DataFrame(data)
print(df)
# name marks grade
# 0 Vishnu 95 A
# 1 Ankit 82 B
# 2 Priya 70 C
print(df.shape) # (3, 3)
print(df.dtypes)
# name object
# marks int64
# grade object
import pandas as pd
data = {
'name': ['Vishnu', 'Ankit', 'Priya', 'Sara', 'Raj'],
'marks': [95, 82, 70, 88, 60],
'grade': ['A', 'B', 'C', 'A', 'D'],
'passed': [True, True, True, True, False]
}
df = pd.DataFrame(data)
df.head(3) # First 3 rows
df.tail(2) # Last 2 rows
df.info() # Column names, dtypes, non-null counts
df.describe() # Statistical summary of numeric columns
df.shape # (5, 4)
df.columns # Index(['name', 'marks', 'grade', 'passed'])
df.index # RangeIndex(start=0, stop=5, step=1)
df.dtypes # dtype for each column
import pandas as pd
df = pd.DataFrame({
'name': ['Vishnu', 'Ankit', 'Priya'],
'marks': [95, 82, 70]
})
# Add a new column β vectorised calculation
df['bonus'] = df['marks'] * 0.1
print(df)
# Output:
# name marks bonus
# 0 Vishnu 95 9.5
# 1 Ankit 82 8.2
# 2 Priya 70 7.0
# Drop a column (returns new DataFrame; original unchanged)
df_clean = df.drop(columns=['bonus'])
print(df_clean)
# Output:
# name marks
# 0 Vishnu 95
# 1 Ankit 82
# 2 Priya 70
Open a terminal, type python3, and explore DataFrames line by line.
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Vishnu', 'Ankit', 'Priya'], 'marks': [95, 82, 70]})
>>> df
name marks
0 Vishnu 95
1 Ankit 82
2 Priya 70
>>> df.shape
(3, 2)
>>> df.dtypes
name object
marks int64
dtype: object
>>> df['marks'].mean()
82.33333333333333
>>> df[df['marks'] > 80]
name marks
0 Vishnu 95
1 Ankit 82
>>> df['bonus'] = df['marks'] * 0.1
>>> df
name marks bonus
0 Vishnu 95 9.5
1 Ankit 82 8.2
2 Priya 70 7.0
New to the REPL?
Type python3 in your terminal. Each >>> is what you type; the line below is Python's response. Type exit() to quit.
π Sample Dry Run: df.info() explainedΒΆ
DataFrame with 4 columns: name (object), marks (int64), grade (object), passed (bool)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4 β 5 rows, index from 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5 non-null object β All 5 rows have a name (no missing)
1 marks 5 non-null int64 β All 5 rows have marks
2 grade 5 non-null object
3 passed 5 non-null bool
dtypes: bool(1), int64(1), object(2)
memory usage: 293.0+ bytes
| Line | What it tells you |
|---|---|
RangeIndex: 5 entries |
There are 5 rows |
Non-Null Count |
How many values are not missing (compare to total rows to spot gaps) |
Dtype |
The data type of each column |
memory usage |
How much RAM the DataFrame uses |
π― Practice Lab π§ͺΒΆ
Task: Employee DataFrame
Create a DataFrame of 4 employees with these columns: name, department, salary.
Then:
- Print its
shapeanddtypes. - Add a
bonuscolumn equal to 10% of salary. - Use
df.describe()and note what it tells you about the salary column. - Drop the
bonuscolumn and print the result.
Hint: Use a dictionary to build the DataFrame. Salary can be integers (e.g., 50000).
π Best Practices & Common MistakesΒΆ
β Best PracticesΒΆ
- Always call
df.info()after loading data β it shows you missing values, wrong dtypes, and unexpected column names before you do any analysis - Use
df.copy()when deriving a new DataFrame βdf2 = df[df['marks'] > 80].copy()prevents theSettingWithCopyWarningwhen you later modifydf2 - Name your columns clearly β
salary_inris better thans; you'll thank yourself when reading the code three days later
β Common MistakesΒΆ
- Modifying a slice without
.copy()βdf2 = df[df['marks'] > 80]thendf2['bonus'] = 5raisesSettingWithCopyWarning. Always use.copy()on derived DataFrames - Confusing
df.shapeanddf.sizeβshapegives(rows, cols)as a tuple;sizegives total cell count (rows Γ cols). CBSE questions test this distinction - Calling
df.describe()expecting string columns βdescribe()only summarises numeric columns by default. String columns are silently skipped
β Frequently Asked QuestionsΒΆ
Q: What's the difference between df.info() and df.describe()?
df.info() shows the structure: column names, how many non-null values each has, and the dtype. df.describe() shows statistics: mean, std, min, 25th/50th/75th percentile, max β but only for numeric columns. Use info() first to understand your data, then describe() to understand the numbers.
Q: Why does df.drop(columns=['bonus']) not change the original DataFrame?
By default, drop() returns a new DataFrame and leaves the original unchanged. To modify in place, use df.drop(columns=['bonus'], inplace=True) β but most pandas style guides recommend avoiding inplace=True and instead reassigning: df = df.drop(columns=['bonus']).
Q: What is SettingWithCopyWarning and how do I fix it?
It appears when you modify a DataFrame that pandas thinks might be a view (a window into another DataFrame) rather than a copy. Fix: always call .copy() when creating a derived DataFrame β df2 = df[df['marks'] > 80].copy(). Then modifying df2 is safe.
Q: CBSE exam β what does df.shape return?
A tuple (number_of_rows, number_of_columns). For example, a DataFrame with 5 students and 3 columns returns (5, 3).
β SummaryΒΆ
In this tutorial, you've learned:
- β A DataFrame is a 2D labelled table β rows have an index, columns have names, each cell has a value
- β
Create DataFrames from a dictionary (
pd.DataFrame(data)) or a list of lists (passcolumns=too) - β
Inspect with
head(),tail(),info(),describe(),shape, anddtypes - β
Add columns with
df['new_col'] = expressionand drop withdf.drop(columns=[...]) - β
Always use
.copy()on derived DataFrames to avoidSettingWithCopyWarning
π‘ Interview & Exam TipsΒΆ
Q: How do you check the data types of all columns in a DataFrame?
df.dtypes β returns a Series where each index is a column name and each value is its dtype (int64, float64, object, etc.)
Q: What is the difference between df.info() and df.describe()?
df.info() shows column names, dtypes, and non-null counts. df.describe() shows statistical summaries (mean, std, min, max) for numeric columns only.
Q: What does df.shape return?
A tuple (rows, columns) β e.g., (5, 3) means 5 rows and 3 columns.
Q: What is the default index of a DataFrame?
RangeIndex starting from 0: 0, 1, 2, ... You can set a custom index with df.set_index('name').
π Further ReadingΒΆ
Continue your learning path:
- β pandas Series β a DataFrame is multiple Series sharing an index
- Next: Indexing & Selection β β learn
.loc[]and.iloc[]to slice DataFrames precisely
Go deeper:
- Official pandas docs β DataFrame β full method reference
- CSV with pandas β load real data from files instead of typing it manually
- Indexing & Selection β filter rows and select columns the right way