Skip to content

DataFrame Basics πŸ—ƒοΈΒΆ

Python Professional PathData Science with pandas

Prerequisites: pandas Series, Python Dictionaries

Mentor's Note: A DataFrame is the heart of pandas β€” think of it as a Python-powered spreadsheet that never crashes, never freezes, and can hold millions of rows without a sweat. Once you master DataFrames, you can analyse anything. πŸ’‘

What You'll Learn

By the end of this tutorial, you'll know:

  • What a DataFrame is β€” rows, columns, index, and dtypes explained clearly
  • How to create a DataFrame from a dictionary, a list of lists, and a CSV file
  • How to inspect a DataFrame with head(), info(), describe(), and shape
  • How to add and drop columns β€” and the SettingWithCopyWarning trap to avoid

🌟 The Scenario: The Student Report Card¢

Imagine your school's term report card:

  • Rows = each student (Vishnu, Ankit, Priya...)
  • Columns = each subject (Maths, English, Science...)
  • Each cell = one mark

That entire report card β€” all students, all subjects, all marks β€” is one pandas DataFrame. You can sort it, filter it, summarise it, and export it in seconds.


πŸ“– Concept ExplanationΒΆ

1. What is a DataFrame?ΒΆ

A DataFrame is a two-dimensional, labelled data structure β€” like a table with:

  • Index β€” row labels (default: 0, 1, 2... or custom)
  • Columns β€” column labels (strings, like 'name', 'marks')
  • Values β€” the actual data in each cell

Each column in a DataFrame is a Series. So a DataFrame is a collection of Series that share the same index.

2. Data Types (dtypes)ΒΆ

Each column can have a different dtype:

dtype Python equivalent Example values
int64 int 95, 82, 0
float64 float 3.14, 82.5
object str 'Vishnu', 'Surat'
bool bool True, False
datetime64 datetime 2024-01-15

3. Key AttributesΒΆ

Attribute What it returns
df.shape Tuple (rows, columns)
df.columns Column names
df.index Row labels
df.dtypes dtype of each column
df.size Total number of cells

🎨 Visual Logic¢

graph TD
    A["Python Dictionary / List of Lists"] --> B["pd.DataFrame()"]
    B --> C["DataFrame\n(rows Γ— columns table)"]
    C --> D["df.index\n(row labels)"]
    C --> E["df.columns\n(column names)"]
    C --> F["df.dtypes\n(type per column)"]

πŸ’» ImplementationΒΆ

import pandas as pd

# Each key = column name, value = list of column values
data = {
    'name':   ['Vishnu', 'Ankit', 'Priya'],
    'marks':  [95, 82, 70],
    'grade':  ['A', 'B', 'C']
}

df = pd.DataFrame(data)
print(df)
#      name  marks grade
# 0  Vishnu     95     A
# 1   Ankit     82     B
# 2   Priya     70     C

print(df.shape)    # (3, 3)
print(df.dtypes)
# name     object
# marks     int64
# grade    object
import pandas as pd

rows = [
    ['Vishnu', 95, 'A'],
    ['Ankit',  82, 'B'],
    ['Priya',  70, 'C']
]

df = pd.DataFrame(rows, columns=['name', 'marks', 'grade'])
print(df)
#      name  marks grade
# 0  Vishnu     95     A
# 1   Ankit     82     B
# 2   Priya     70     C
import pandas as pd

data = {
    'name':   ['Vishnu', 'Ankit', 'Priya', 'Sara', 'Raj'],
    'marks':  [95, 82, 70, 88, 60],
    'grade':  ['A', 'B', 'C', 'A', 'D'],
    'passed': [True, True, True, True, False]
}
df = pd.DataFrame(data)

df.head(3)       # First 3 rows
df.tail(2)       # Last 2 rows
df.info()        # Column names, dtypes, non-null counts
df.describe()    # Statistical summary of numeric columns
df.shape         # (5, 4)
df.columns       # Index(['name', 'marks', 'grade', 'passed'])
df.index         # RangeIndex(start=0, stop=5, step=1)
df.dtypes        # dtype for each column
import pandas as pd

df = pd.DataFrame({
    'name':  ['Vishnu', 'Ankit', 'Priya'],
    'marks': [95, 82, 70]
})

# Add a new column β€” vectorised calculation
df['bonus'] = df['marks'] * 0.1
print(df)
# Output:
#      name  marks  bonus
# 0  Vishnu     95    9.5
# 1   Ankit     82    8.2
# 2   Priya     70    7.0

# Drop a column (returns new DataFrame; original unchanged)
df_clean = df.drop(columns=['bonus'])
print(df_clean)
# Output:
#      name  marks
# 0  Vishnu     95
# 1   Ankit     82
# 2   Priya     70

Open a terminal, type python3, and explore DataFrames line by line.

>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Vishnu', 'Ankit', 'Priya'], 'marks': [95, 82, 70]})
>>> df
     name  marks
0  Vishnu     95
1   Ankit     82
2   Priya     70
>>> df.shape
(3, 2)
>>> df.dtypes
name     object
marks     int64
dtype: object
>>> df['marks'].mean()
82.33333333333333
>>> df[df['marks'] > 80]
     name  marks
0  Vishnu     95
1   Ankit     82
>>> df['bonus'] = df['marks'] * 0.1
>>> df
     name  marks  bonus
0  Vishnu     95    9.5
1   Ankit     82    8.2
2   Priya     70    7.0

New to the REPL?

Type python3 in your terminal. Each >>> is what you type; the line below is Python's response. Type exit() to quit.


πŸ“Š Sample Dry Run: df.info() explainedΒΆ

DataFrame with 4 columns: name (object), marks (int64), grade (object), passed (bool)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4       ← 5 rows, index from 0 to 4
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   name     5 non-null      object    ← All 5 rows have a name (no missing)
 1   marks    5 non-null      int64     ← All 5 rows have marks
 2   grade    5 non-null      object
 3   passed   5 non-null      bool
dtypes: bool(1), int64(1), object(2)
memory usage: 293.0+ bytes
Line What it tells you
RangeIndex: 5 entries There are 5 rows
Non-Null Count How many values are not missing (compare to total rows to spot gaps)
Dtype The data type of each column
memory usage How much RAM the DataFrame uses

🎯 Practice Lab πŸ§ͺΒΆ

Task: Employee DataFrame

Create a DataFrame of 4 employees with these columns: name, department, salary.

Then:

  1. Print its shape and dtypes.
  2. Add a bonus column equal to 10% of salary.
  3. Use df.describe() and note what it tells you about the salary column.
  4. Drop the bonus column and print the result.

Hint: Use a dictionary to build the DataFrame. Salary can be integers (e.g., 50000).


πŸ“š Best Practices & Common MistakesΒΆ

βœ… Best PracticesΒΆ

  • Always call df.info() after loading data β€” it shows you missing values, wrong dtypes, and unexpected column names before you do any analysis
  • Use df.copy() when deriving a new DataFrame β€” df2 = df[df['marks'] > 80].copy() prevents the SettingWithCopyWarning when you later modify df2
  • Name your columns clearly β€” salary_inr is better than s; you'll thank yourself when reading the code three days later

❌ Common Mistakes¢

  • Modifying a slice without .copy() β€” df2 = df[df['marks'] > 80] then df2['bonus'] = 5 raises SettingWithCopyWarning. Always use .copy() on derived DataFrames
  • Confusing df.shape and df.size β€” shape gives (rows, cols) as a tuple; size gives total cell count (rows Γ— cols). CBSE questions test this distinction
  • Calling df.describe() expecting string columns β€” describe() only summarises numeric columns by default. String columns are silently skipped

❓ Frequently Asked QuestionsΒΆ

Q: What's the difference between df.info() and df.describe()?

df.info() shows the structure: column names, how many non-null values each has, and the dtype. df.describe() shows statistics: mean, std, min, 25th/50th/75th percentile, max β€” but only for numeric columns. Use info() first to understand your data, then describe() to understand the numbers.

Q: Why does df.drop(columns=['bonus']) not change the original DataFrame?

By default, drop() returns a new DataFrame and leaves the original unchanged. To modify in place, use df.drop(columns=['bonus'], inplace=True) β€” but most pandas style guides recommend avoiding inplace=True and instead reassigning: df = df.drop(columns=['bonus']).

Q: What is SettingWithCopyWarning and how do I fix it?

It appears when you modify a DataFrame that pandas thinks might be a view (a window into another DataFrame) rather than a copy. Fix: always call .copy() when creating a derived DataFrame β€” df2 = df[df['marks'] > 80].copy(). Then modifying df2 is safe.

Q: CBSE exam β€” what does df.shape return?

A tuple (number_of_rows, number_of_columns). For example, a DataFrame with 5 students and 3 columns returns (5, 3).


βœ… SummaryΒΆ

In this tutorial, you've learned:

  • βœ… A DataFrame is a 2D labelled table β€” rows have an index, columns have names, each cell has a value
  • βœ… Create DataFrames from a dictionary (pd.DataFrame(data)) or a list of lists (pass columns= too)
  • βœ… Inspect with head(), tail(), info(), describe(), shape, and dtypes
  • βœ… Add columns with df['new_col'] = expression and drop with df.drop(columns=[...])
  • βœ… Always use .copy() on derived DataFrames to avoid SettingWithCopyWarning

πŸ’‘ Interview & Exam TipsΒΆ

Q: How do you check the data types of all columns in a DataFrame?

df.dtypes β€” returns a Series where each index is a column name and each value is its dtype (int64, float64, object, etc.)

Q: What is the difference between df.info() and df.describe()?

df.info() shows column names, dtypes, and non-null counts. df.describe() shows statistical summaries (mean, std, min, max) for numeric columns only.

Q: What does df.shape return?

A tuple (rows, columns) β€” e.g., (5, 3) means 5 rows and 3 columns.

Q: What is the default index of a DataFrame?

RangeIndex starting from 0: 0, 1, 2, ... You can set a custom index with df.set_index('name').


πŸ“š Further ReadingΒΆ

Continue your learning path:

Go deeper: