Deleting Duplicate Rows ๐งน¶
Mentor's Note: Duplicates are the "Dust" of your database. They happen because of bugs, missing constraints, or bad imports. Cleaning them is a vital skill for every Data Engineer. ๐ก
๐ The Scenario: The Double Invite ๐¶
Imagine you are sending wedding invitations. - The Problem: "Rahul Sharma" is accidentally in your list twice. - The Goal: You want to delete the extra copy but keep one original. - The Challenge: Since they are exactly the same, how do you tell the robot which one to throw away?
๐ป Method 1: Using ROWID (Oracle) ๐¶
Oracle gives every row a unique physical address called a ROWID.
-- Scenario: Keep the first copy, delete others
DELETE FROM students
WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM students
GROUP BY name, email -- Define what makes them a 'duplicate'
);
๐ป Method 2: Using CTE & ROW_NUMBER() (SQL Server / Postgre) ๐¶
This is the most modern and readable way.
WITH CTE AS (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY name, email
ORDER BY id
) AS occurrence
FROM students
)
DELETE FROM CTE WHERE occurrence > 1;
๐ป Method 3: Using a Temporary Table (Universal) ๐¶
This works in every database and is very safe.
- Select distinct records into a new table.
- Drop the old table.
- Rename the clean table.
๐จ Visual Logic: The Sieve Method¶
graph TD
A[Raw Data] --> B{Group by Unique Fields}
B --> C[Keep 1st Instance โ
]
B --> D[Send others to DELETE ๐๏ธ]
๐ก Prevention Tip¶
"The best way to handle duplicates is to prevent them! Always use UNIQUE CONSTRAINTS on your email or phone number columns so the database rejects the duplicate before it even gets in."