Deleting Duplicate Rows 🧹¶

Mentor's Note: Duplicates are the "Dust" of your database. They happen because of bugs, missing constraints, or bad imports. Cleaning them is a vital skill for every Data Engineer. 💡

🌟 The Scenario: The Double Invite 💌¶

Imagine you are sending wedding invitations. - The Problem: "Rahul Sharma" is accidentally in your list twice. - The Goal: You want to delete the extra copy but keep one original. - The Challenge: Since they are exactly the same, how do you tell the robot which one to throw away?

💻 Method 1: Using ROWID (Oracle) 🆔¶

Oracle gives every row a unique physical address called a ROWID.

-- Scenario: Keep the first copy, delete others
DELETE FROM students
WHERE rowid NOT IN (
    SELECT MIN(rowid)
    FROM students
    GROUP BY name, email -- Define what makes them a 'duplicate'
);

💻 Method 2: Using CTE & ROW_NUMBER() (SQL Server / Postgre) 📊¶

This is the most modern and readable way.

WITH CTE AS (
    SELECT 
        id, 
        ROW_NUMBER() OVER (
            PARTITION BY name, email 
            ORDER BY id
        ) AS occurrence
    FROM students
)
DELETE FROM CTE WHERE occurrence > 1;

💻 Method 3: Using a Temporary Table (Universal) 📑¶

This works in every database and is very safe.

Select distinct records into a new table.

CREATE TABLE students_clean AS 
SELECT DISTINCT * FROM students;

Drop the old table.
```
DROP TABLE students;
```
Rename the clean table.
```
RENAME students_clean TO students;
```

🎨 Visual Logic: The Sieve Method¶

graph TD
    A[Raw Data] --> B{Group by Unique Fields}
    B --> C[Keep 1st Instance ✅]
    B --> D[Send others to DELETE 🗑️]

💡 Prevention Tip¶

"The best way to handle duplicates is to prevent them! Always use UNIQUE CONSTRAINTS on your email or phone number columns so the database rejects the duplicate before it even gets in."