DataFrame Basics 🗃️¶

Python Professional PathData Science with pandas

Prerequisites: pandas Series, Python Dictionaries

Mentor's Note: A DataFrame is the heart of pandas — think of it as a Python-powered spreadsheet that never crashes, never freezes, and can hold millions of rows without a sweat. Once you master DataFrames, you can analyse anything. 💡

What You'll Learn

By the end of this tutorial, you'll know:

What a DataFrame is — rows, columns, index, and dtypes explained clearly
How to create a DataFrame from a dictionary, a list of lists, and a CSV file
How to inspect a DataFrame with head(), info(), describe(), and shape
How to add and drop columns — and the SettingWithCopyWarning trap to avoid

🌟 The Scenario: The Student Report Card¶

Imagine your school's term report card:

Rows = each student (Vishnu, Ankit, Priya...)
Columns = each subject (Maths, English, Science...)
Each cell = one mark

That entire report card — all students, all subjects, all marks — is one pandas DataFrame. You can sort it, filter it, summarise it, and export it in seconds.

📖 Concept Explanation¶

1. What is a DataFrame?¶

A DataFrame is a two-dimensional, labelled data structure — like a table with:

Index — row labels (default: 0, 1, 2... or custom)
Columns — column labels (strings, like 'name', 'marks')
Values — the actual data in each cell

Each column in a DataFrame is a Series. So a DataFrame is a collection of Series that share the same index.

2. Data Types (dtypes)¶

Each column can have a different dtype:

dtype	Python equivalent	Example values
`int64`	`int`	95, 82, 0
`float64`	`float`	3.14, 82.5
`object`	`str`	'Vishnu', 'Surat'
`bool`	`bool`	True, False
`datetime64`	`datetime`	2024-01-15

3. Key Attributes¶

Attribute	What it returns
`df.shape`	Tuple `(rows, columns)`
`df.columns`	Column names
`df.index`	Row labels
`df.dtypes`	dtype of each column
`df.size`	Total number of cells

🎨 Visual Logic¶

graph TD
    A["Python Dictionary / List of Lists"] --> B["pd.DataFrame()"]
    B --> C["DataFrame\n(rows × columns table)"]
    C --> D["df.index\n(row labels)"]
    C --> E["df.columns\n(column names)"]
    C --> F["df.dtypes\n(type per column)"]

💻 Implementation¶

1. From a Dictionary2. From a List of Lists3. Inspection Methods4. Add & Drop Columns5. Interactive REPL

import pandas as pd

# Each key = column name, value = list of column values
data = {
    'name':   ['Vishnu', 'Ankit', 'Priya'],
    'marks':  [95, 82, 70],
    'grade':  ['A', 'B', 'C']
}

df = pd.DataFrame(data)
print(df)
#      name  marks grade
# 0  Vishnu     95     A
# 1   Ankit     82     B
# 2   Priya     70     C

print(df.shape)    # (3, 3)
print(df.dtypes)
# name     object
# marks     int64
# grade    object

import pandas as pd

rows = [
    ['Vishnu', 95, 'A'],
    ['Ankit',  82, 'B'],
    ['Priya',  70, 'C']
]

df = pd.DataFrame(rows, columns=['name', 'marks', 'grade'])
print(df)
#      name  marks grade
# 0  Vishnu     95     A
# 1   Ankit     82     B
# 2   Priya     70     C

import pandas as pd

data = {
    'name':   ['Vishnu', 'Ankit', 'Priya', 'Sara', 'Raj'],
    'marks':  [95, 82, 70, 88, 60],
    'grade':  ['A', 'B', 'C', 'A', 'D'],
    'passed': [True, True, True, True, False]
}
df = pd.DataFrame(data)

df.head(3)       # First 3 rows
df.tail(2)       # Last 2 rows
df.info()        # Column names, dtypes, non-null counts
df.describe()    # Statistical summary of numeric columns
df.shape         # (5, 4)
df.columns       # Index(['name', 'marks', 'grade', 'passed'])
df.index         # RangeIndex(start=0, stop=5, step=1)
df.dtypes        # dtype for each column

import pandas as pd

df = pd.DataFrame({
    'name':  ['Vishnu', 'Ankit', 'Priya'],
    'marks': [95, 82, 70]
})

# Add a new column — vectorised calculation
df['bonus'] = df['marks'] * 0.1
print(df)
# Output:
#      name  marks  bonus
# 0  Vishnu     95    9.5
# 1   Ankit     82    8.2
# 2   Priya     70    7.0

# Drop a column (returns new DataFrame; original unchanged)
df_clean = df.drop(columns=['bonus'])
print(df_clean)
# Output:
#      name  marks
# 0  Vishnu     95
# 1   Ankit     82
# 2   Priya     70

Open a terminal, type python3, and explore DataFrames line by line.

>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Vishnu', 'Ankit', 'Priya'], 'marks': [95, 82, 70]})
>>> df
     name  marks
0  Vishnu     95
1   Ankit     82
2   Priya     70
>>> df.shape
(3, 2)
>>> df.dtypes
name     object
marks     int64
dtype: object
>>> df['marks'].mean()
82.33333333333333
>>> df[df['marks'] > 80]
     name  marks
0  Vishnu     95
1   Ankit     82
>>> df['bonus'] = df['marks'] * 0.1
>>> df
     name  marks  bonus
0  Vishnu     95    9.5
1   Ankit     82    8.2
2   Priya     70    7.0

New to the REPL?

Type python3 in your terminal. Each >>> is what you type; the line below is Python's response. Type exit() to quit.

📊 Sample Dry Run: `df.info()` explained¶

DataFrame with 4 columns: name (object), marks (int64), grade (object), passed (bool)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4       ← 5 rows, index from 0 to 4
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   name     5 non-null      object    ← All 5 rows have a name (no missing)
 1   marks    5 non-null      int64     ← All 5 rows have marks
 2   grade    5 non-null      object
 3   passed   5 non-null      bool
dtypes: bool(1), int64(1), object(2)
memory usage: 293.0+ bytes

Line	What it tells you
`RangeIndex: 5 entries`	There are 5 rows
`Non-Null Count`	How many values are not missing (compare to total rows to spot gaps)
`Dtype`	The data type of each column
`memory usage`	How much RAM the DataFrame uses

🎯 Practice Lab 🧪¶

Task: Employee DataFrame

Create a DataFrame of 4 employees with these columns: name, department, salary.

Then:

Print its shape and dtypes.
Add a bonus column equal to 10% of salary.
Use df.describe() and note what it tells you about the salary column.
Drop the bonus column and print the result.

Hint: Use a dictionary to build the DataFrame. Salary can be integers (e.g., 50000).

📚 Best Practices & Common Mistakes¶

✅ Best Practices¶

Always call df.info() after loading data — it shows you missing values, wrong dtypes, and unexpected column names before you do any analysis
Use df.copy() when deriving a new DataFrame — df2 = df[df['marks'] > 80].copy() prevents the SettingWithCopyWarning when you later modify df2
Name your columns clearly — salary_inr is better than s; you'll thank yourself when reading the code three days later

❌ Common Mistakes¶

Modifying a slice without .copy() — df2 = df[df['marks'] > 80] then df2['bonus'] = 5 raises SettingWithCopyWarning. Always use .copy() on derived DataFrames
Confusing df.shape and df.size — shape gives (rows, cols) as a tuple; size gives total cell count (rows × cols). CBSE questions test this distinction
Calling df.describe() expecting string columns — describe() only summarises numeric columns by default. String columns are silently skipped

❓ Frequently Asked Questions¶

Q: What's the difference between df.info() and df.describe()?

df.info() shows the structure: column names, how many non-null values each has, and the dtype. df.describe() shows statistics: mean, std, min, 25^th/50^th/75^th percentile, max — but only for numeric columns. Use info() first to understand your data, then describe() to understand the numbers.

Q: Why does df.drop(columns=['bonus']) not change the original DataFrame?

By default, drop() returns a new DataFrame and leaves the original unchanged. To modify in place, use df.drop(columns=['bonus'], inplace=True) — but most pandas style guides recommend avoiding inplace=True and instead reassigning: df = df.drop(columns=['bonus']).

Q: What is SettingWithCopyWarning and how do I fix it?

It appears when you modify a DataFrame that pandas thinks might be a view (a window into another DataFrame) rather than a copy. Fix: always call .copy() when creating a derived DataFrame — df2 = df[df['marks'] > 80].copy(). Then modifying df2 is safe.

Q: CBSE exam — what does df.shape return?

A tuple (number_of_rows, number_of_columns). For example, a DataFrame with 5 students and 3 columns returns (5, 3).

✅ Summary¶

In this tutorial, you've learned:

✅ A DataFrame is a 2D labelled table — rows have an index, columns have names, each cell has a value
✅ Create DataFrames from a dictionary (pd.DataFrame(data)) or a list of lists (pass columns= too)
✅ Inspect with head(), tail(), info(), describe(), shape, and dtypes
✅ Add columns with df['new_col'] = expression and drop with df.drop(columns=[...])
✅ Always use .copy() on derived DataFrames to avoid SettingWithCopyWarning

💡 Interview & Exam Tips¶

Q: How do you check the data types of all columns in a DataFrame?

df.dtypes — returns a Series where each index is a column name and each value is its dtype (int64, float64, object, etc.)

Q: What is the difference between df.info() and df.describe()?

df.info() shows column names, dtypes, and non-null counts. df.describe() shows statistical summaries (mean, std, min, max) for numeric columns only.

Q: What does df.shape return?

A tuple (rows, columns) — e.g., (5, 3) means 5 rows and 3 columns.

Q: What is the default index of a DataFrame?

RangeIndex starting from 0: 0, 1, 2, ... You can set a custom index with df.set_index('name').

📚 Further Reading¶

Continue your learning path:

← pandas Series — a DataFrame is multiple Series sharing an index
Next: Indexing & Selection → — learn .loc[] and .iloc[] to slice DataFrames precisely

Go deeper:

Official pandas docs — DataFrame — full method reference
CSV with pandas — load real data from files instead of typing it manually
Indexing & Selection — filter rows and select columns the right way