DataFrame Basics ๐๏ธ
Mentor's Note: A DataFrame is the heart of pandas โ think of it as a Python-powered spreadsheet that never crashes, never freezes, and can hold millions of rows without a sweat. Once you master DataFrames, you can analyse anything. ๐ก
By the end of this tutorial, you'll know:
- What a DataFrame is โ rows, columns, index, and dtypes explained clearly
- How to create a DataFrame from a dictionary, a list of lists, and a CSV file
- How to inspect a DataFrame with
head(),info(),describe(), andshape - How to add and drop columns โ and the
SettingWithCopyWarningtrap to avoid
๐ The Scenario: The Student Report Cardโ
Imagine your school's term report card:
- Rows = each student (Vishnu, Ankit, Priya...)
- Columns = each subject (Maths, English, Science...)
- Each cell = one mark
That entire report card โ all students, all subjects, all marks โ is one pandas DataFrame. You can sort it, filter it, summarise it, and export it in seconds.
๐ Concept Explanationโ
1. What is a DataFrame?โ
A DataFrame is a two-dimensional, labelled data structure โ like a table with:
- Index โ row labels (default: 0, 1, 2... or custom)
- Columns โ column labels (strings, like
'name','marks') - Values โ the actual data in each cell
Each column in a DataFrame is a Series. So a DataFrame is a collection of Series that share the same index.
2. Data Types (dtypes)โ
Each column can have a different dtype:
| dtype | Python equivalent | Example values |
|---|---|---|
int64 | int | 95, 82, 0 |
float64 | float | 3.14, 82.5 |
object | str | 'Vishnu', 'Surat' |
bool | bool | True, False |
datetime64 | datetime | 2024-01-15 |
3. Key Attributesโ
| Attribute | What it returns |
|---|---|
df.shape | Tuple (rows, columns) |
df.columns | Column names |
df.index | Row labels |
df.dtypes | dtype of each column |
df.size | Total number of cells |
๐จ Visual Logicโ
๐ป Implementationโ
- 1. From a Dictionary
- 2. From a List of Lists
- 3. Inspection Methods
- 4. Add & Drop Columns
- 5. Interactive REPL
import pandas as pd
# Each key = column name, value = list of column values
data = {
'name': ['Vishnu', 'Ankit', 'Priya'],
'marks': [95, 82, 70],
'grade': ['A', 'B', 'C']
}
df = pd.DataFrame(data)
print(df)
# name marks grade
# 0 Vishnu 95 A
# 1 Ankit 82 B
# 2 Priya 70 C
print(df.shape) # (3, 3)
print(df.dtypes)
# name object
# marks int64
# grade object
import pandas as pd
rows = [
['Vishnu', 95, 'A'],
['Ankit', 82, 'B'],
['Priya', 70, 'C']
]
df = pd.DataFrame(rows, columns=['name', 'marks', 'grade'])
print(df)
# name marks grade
# 0 Vishnu 95 A
# 1 Ankit 82 B
# 2 Priya 70 C
import pandas as pd
data = {
'name': ['Vishnu', 'Ankit', 'Priya', 'Sara', 'Raj'],
'marks': [95, 82, 70, 88, 60],
'grade': ['A', 'B', 'C', 'A', 'D'],
'passed': [True, True, True, True, False]
}
df = pd.DataFrame(data)
df.head(3) # First 3 rows
df.tail(2) # Last 2 rows
df.info() # Column names, dtypes, non-null counts
df.describe() # Statistical summary of numeric columns
df.shape # (5, 4)
df.columns # Index(['name', 'marks', 'grade', 'passed'])
df.index # RangeIndex(start=0, stop=5, step=1)
df.dtypes # dtype for each column
import pandas as pd
df = pd.DataFrame({
'name': ['Vishnu', 'Ankit', 'Priya'],
'marks': [95, 82, 70]
})
# Add a new column โ vectorised calculation
df['bonus'] = df['marks'] * 0.1
print(df)
# Output:
# name marks bonus
# 0 Vishnu 95 9.5
# 1 Ankit 82 8.2
# 2 Priya 70 7.0
# Drop a column (returns new DataFrame; original unchanged)
df_clean = df.drop(columns=['bonus'])
print(df_clean)
# Output:
# name marks
# 0 Vishnu 95
# 1 Ankit 82
# 2 Priya 70
Open a terminal, type python3, and explore DataFrames line by line.
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Vishnu', 'Ankit', 'Priya'], 'marks': [95, 82, 70]})
>>> df
name marks
0 Vishnu 95
1 Ankit 82
2 Priya 70
>>> df.shape
(3, 2)
>>> df.dtypes
name object
marks int64
dtype: object
>>> df['marks'].mean()
82.33333333333333
>>> df[df['marks'] > 80]
name marks
0 Vishnu 95
1 Ankit 82
>>> df['bonus'] = df['marks'] * 0.1
>>> df
name marks bonus
0 Vishnu 95 9.5
1 Ankit 82 8.2
2 Priya 70 7.0
Type python3 in your terminal. Each >>> is what you type; the line below is Python's response. Type exit() to quit.
๐ Sample Dry Run: df.info() explainedโ
DataFrame with 4 columns: name (object), marks (int64), grade (object), passed (bool)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4 โ 5 rows, index from 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5 non-null object โ All 5 rows have a name (no missing)
1 marks 5 non-null int64 โ All 5 rows have marks
2 grade 5 non-null object
3 passed 5 non-null bool
dtypes: bool(1), int64(1), object(2)
memory usage: 293.0+ bytes
| Line | What it tells you |
|---|---|
RangeIndex: 5 entries | There are 5 rows |
Non-Null Count | How many values are not missing (compare to total rows to spot gaps) |
Dtype | The data type of each column |
memory usage | How much RAM the DataFrame uses |
๐ฏ Practice Lab ๐งชโ
Create a DataFrame of 4 employees with these columns: name, department, salary.
Then:
- Print its
shapeanddtypes. - Add a
bonuscolumn equal to 10% of salary. - Use
df.describe()and note what it tells you about the salary column. - Drop the
bonuscolumn and print the result.
Hint: Use a dictionary to build the DataFrame. Salary can be integers (e.g., 50000).
๐ Best Practices & Common Mistakesโ
โ Best Practicesโ
- Always call
df.info()after loading data โ it shows you missing values, wrong dtypes, and unexpected column names before you do any analysis - Use
df.copy()when deriving a new DataFrame โdf2 = df[df['marks'] > 80].copy()prevents theSettingWithCopyWarningwhen you later modifydf2 - Name your columns clearly โ
salary_inris better thans; you'll thank yourself when reading the code three days later
โ Common Mistakesโ
- Modifying a slice without
.copy()โdf2 = df[df['marks'] > 80]thendf2['bonus'] = 5raisesSettingWithCopyWarning. Always use.copy()on derived DataFrames - Confusing
df.shapeanddf.sizeโshapegives(rows, cols)as a tuple;sizegives total cell count (rows ร cols). CBSE questions test this distinction - Calling
df.describe()expecting string columns โdescribe()only summarises numeric columns by default. String columns are silently skipped
โ Frequently Asked Questionsโ
Q: What's the difference between df.info() and df.describe()?
df.info() shows the structure: column names, how many non-null values each has, and the dtype. df.describe() shows statistics: mean, std, min, 25th/50th/75th percentile, max โ but only for numeric columns. Use info() first to understand your data, then describe() to understand the numbers.
Q: Why does df.drop(columns=['bonus']) not change the original DataFrame?
By default, drop() returns a new DataFrame and leaves the original unchanged. To modify in place, use df.drop(columns=['bonus'], inplace=True) โ but most pandas style guides recommend avoiding inplace=True and instead reassigning: df = df.drop(columns=['bonus']).
Q: What is SettingWithCopyWarning and how do I fix it?
It appears when you modify a DataFrame that pandas thinks might be a view (a window into another DataFrame) rather than a copy. Fix: always call .copy() when creating a derived DataFrame โ df2 = df[df['marks'] > 80].copy(). Then modifying df2 is safe.
Q: CBSE exam โ what does df.shape return?
A tuple (number_of_rows, number_of_columns). For example, a DataFrame with 5 students and 3 columns returns (5, 3).
โ Summaryโ
In this tutorial, you've learned:
- โ A DataFrame is a 2D labelled table โ rows have an index, columns have names, each cell has a value
- โ
Create DataFrames from a dictionary (
pd.DataFrame(data)) or a list of lists (passcolumns=too) - โ
Inspect with
head(),tail(),info(),describe(),shape, anddtypes - โ
Add columns with
df['new_col'] = expressionand drop withdf.drop(columns=[...]) - โ
Always use
.copy()on derived DataFrames to avoidSettingWithCopyWarning
๐ก Interview & Exam Tipsโ
Q: How do you check the data types of all columns in a DataFrame?
df.dtypes โ returns a Series where each index is a column name and each value is its dtype (int64, float64, object, etc.)
Q: What is the difference between df.info() and df.describe()?
df.info() shows column names, dtypes, and non-null counts. df.describe() shows statistical summaries (mean, std, min, max) for numeric columns only.
Q: What does df.shape return?
A tuple (rows, columns) โ e.g., (5, 3) means 5 rows and 3 columns.
Q: What is the default index of a DataFrame?
RangeIndex starting from 0: 0, 1, 2, ... You can set a custom index with df.set_index('name').
๐ Further Readingโ
Continue your learning path:
- โ pandas Series โ a DataFrame is multiple Series sharing an index
- Next: Indexing & Selection โ โ learn
.loc[]and.iloc[]to slice DataFrames precisely
Go deeper:
- Official pandas docs โ DataFrame โ full method reference
- CSV with pandas โ load real data from files instead of typing it manually
- Indexing & Selection โ filter rows and select columns the right way