Skip to main content

DataFrame Basics ๐Ÿ—ƒ๏ธ

Mentor's Note: A DataFrame is the heart of pandas โ€” think of it as a Python-powered spreadsheet that never crashes, never freezes, and can hold millions of rows without a sweat. Once you master DataFrames, you can analyse anything. ๐Ÿ’ก

What You'll Learn

By the end of this tutorial, you'll know:

  • What a DataFrame is โ€” rows, columns, index, and dtypes explained clearly
  • How to create a DataFrame from a dictionary, a list of lists, and a CSV file
  • How to inspect a DataFrame with head(), info(), describe(), and shape
  • How to add and drop columns โ€” and the SettingWithCopyWarning trap to avoid

๐ŸŒŸ The Scenario: The Student Report Cardโ€‹

Imagine your school's term report card:

  • Rows = each student (Vishnu, Ankit, Priya...)
  • Columns = each subject (Maths, English, Science...)
  • Each cell = one mark

That entire report card โ€” all students, all subjects, all marks โ€” is one pandas DataFrame. You can sort it, filter it, summarise it, and export it in seconds.


๐Ÿ“– Concept Explanationโ€‹

1. What is a DataFrame?โ€‹

A DataFrame is a two-dimensional, labelled data structure โ€” like a table with:

  • Index โ€” row labels (default: 0, 1, 2... or custom)
  • Columns โ€” column labels (strings, like 'name', 'marks')
  • Values โ€” the actual data in each cell

Each column in a DataFrame is a Series. So a DataFrame is a collection of Series that share the same index.

2. Data Types (dtypes)โ€‹

Each column can have a different dtype:

dtypePython equivalentExample values
int64int95, 82, 0
float64float3.14, 82.5
objectstr'Vishnu', 'Surat'
boolboolTrue, False
datetime64datetime2024-01-15

3. Key Attributesโ€‹

AttributeWhat it returns
df.shapeTuple (rows, columns)
df.columnsColumn names
df.indexRow labels
df.dtypesdtype of each column
df.sizeTotal number of cells

๐ŸŽจ Visual Logicโ€‹


๐Ÿ’ป Implementationโ€‹

import pandas as pd

# Each key = column name, value = list of column values
data = {
'name': ['Vishnu', 'Ankit', 'Priya'],
'marks': [95, 82, 70],
'grade': ['A', 'B', 'C']
}

df = pd.DataFrame(data)
print(df)
# name marks grade
# 0 Vishnu 95 A
# 1 Ankit 82 B
# 2 Priya 70 C

print(df.shape) # (3, 3)
print(df.dtypes)
# name object
# marks int64
# grade object

๐Ÿ“Š Sample Dry Run: df.info() explainedโ€‹

DataFrame with 4 columns: name (object), marks (int64), grade (object), passed (bool)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4 โ† 5 rows, index from 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5 non-null object โ† All 5 rows have a name (no missing)
1 marks 5 non-null int64 โ† All 5 rows have marks
2 grade 5 non-null object
3 passed 5 non-null bool
dtypes: bool(1), int64(1), object(2)
memory usage: 293.0+ bytes
LineWhat it tells you
RangeIndex: 5 entriesThere are 5 rows
Non-Null CountHow many values are not missing (compare to total rows to spot gaps)
DtypeThe data type of each column
memory usageHow much RAM the DataFrame uses

๐ŸŽฏ Practice Lab ๐Ÿงชโ€‹

Task: Employee DataFrame

Create a DataFrame of 4 employees with these columns: name, department, salary.

Then:

  1. Print its shape and dtypes.
  2. Add a bonus column equal to 10% of salary.
  3. Use df.describe() and note what it tells you about the salary column.
  4. Drop the bonus column and print the result.

Hint: Use a dictionary to build the DataFrame. Salary can be integers (e.g., 50000).


๐Ÿ“š Best Practices & Common Mistakesโ€‹

โœ… Best Practicesโ€‹

  • Always call df.info() after loading data โ€” it shows you missing values, wrong dtypes, and unexpected column names before you do any analysis
  • Use df.copy() when deriving a new DataFrame โ€” df2 = df[df['marks'] > 80].copy() prevents the SettingWithCopyWarning when you later modify df2
  • Name your columns clearly โ€” salary_inr is better than s; you'll thank yourself when reading the code three days later

โŒ Common Mistakesโ€‹

  • Modifying a slice without .copy() โ€” df2 = df[df['marks'] > 80] then df2['bonus'] = 5 raises SettingWithCopyWarning. Always use .copy() on derived DataFrames
  • Confusing df.shape and df.size โ€” shape gives (rows, cols) as a tuple; size gives total cell count (rows ร— cols). CBSE questions test this distinction
  • Calling df.describe() expecting string columns โ€” describe() only summarises numeric columns by default. String columns are silently skipped

โ“ Frequently Asked Questionsโ€‹

Q: What's the difference between df.info() and df.describe()?

df.info() shows the structure: column names, how many non-null values each has, and the dtype. df.describe() shows statistics: mean, std, min, 25th/50th/75th percentile, max โ€” but only for numeric columns. Use info() first to understand your data, then describe() to understand the numbers.

Q: Why does df.drop(columns=['bonus']) not change the original DataFrame?

By default, drop() returns a new DataFrame and leaves the original unchanged. To modify in place, use df.drop(columns=['bonus'], inplace=True) โ€” but most pandas style guides recommend avoiding inplace=True and instead reassigning: df = df.drop(columns=['bonus']).

Q: What is SettingWithCopyWarning and how do I fix it?

It appears when you modify a DataFrame that pandas thinks might be a view (a window into another DataFrame) rather than a copy. Fix: always call .copy() when creating a derived DataFrame โ€” df2 = df[df['marks'] > 80].copy(). Then modifying df2 is safe.

Q: CBSE exam โ€” what does df.shape return?

A tuple (number_of_rows, number_of_columns). For example, a DataFrame with 5 students and 3 columns returns (5, 3).


โœ… Summaryโ€‹

In this tutorial, you've learned:

  • โœ… A DataFrame is a 2D labelled table โ€” rows have an index, columns have names, each cell has a value
  • โœ… Create DataFrames from a dictionary (pd.DataFrame(data)) or a list of lists (pass columns= too)
  • โœ… Inspect with head(), tail(), info(), describe(), shape, and dtypes
  • โœ… Add columns with df['new_col'] = expression and drop with df.drop(columns=[...])
  • โœ… Always use .copy() on derived DataFrames to avoid SettingWithCopyWarning

๐Ÿ’ก Interview & Exam Tipsโ€‹

Q: How do you check the data types of all columns in a DataFrame?

df.dtypes โ€” returns a Series where each index is a column name and each value is its dtype (int64, float64, object, etc.)

Q: What is the difference between df.info() and df.describe()?

df.info() shows column names, dtypes, and non-null counts. df.describe() shows statistical summaries (mean, std, min, max) for numeric columns only.

Q: What does df.shape return?

A tuple (rows, columns) โ€” e.g., (5, 3) means 5 rows and 3 columns.

Q: What is the default index of a DataFrame?

RangeIndex starting from 0: 0, 1, 2, ... You can set a custom index with df.set_index('name').


๐Ÿ“š Further Readingโ€‹

Continue your learning path:

Go deeper: