Skip to content

Deleting Duplicate Rows ๐Ÿงน

Mentor's Note: Duplicates are the "Dust" of your database. They happen because of bugs, missing constraints, or bad imports. Cleaning them is a vital skill for every Data Engineer. ๐Ÿ’ก


๐ŸŒŸ The Scenario: The Double Invite ๐Ÿ’Œ

Imagine you are sending wedding invitations. - The Problem: "Rahul Sharma" is accidentally in your list twice. - The Goal: You want to delete the extra copy but keep one original. - The Challenge: Since they are exactly the same, how do you tell the robot which one to throw away?


๐Ÿ’ป Method 1: Using ROWID (Oracle) ๐Ÿ†”

Oracle gives every row a unique physical address called a ROWID.

-- Scenario: Keep the first copy, delete others
DELETE FROM students
WHERE rowid NOT IN (
    SELECT MIN(rowid)
    FROM students
    GROUP BY name, email -- Define what makes them a 'duplicate'
);

๐Ÿ’ป Method 2: Using CTE & ROW_NUMBER() (SQL Server / Postgre) ๐Ÿ“Š

This is the most modern and readable way.

WITH CTE AS (
    SELECT 
        id, 
        ROW_NUMBER() OVER (
            PARTITION BY name, email 
            ORDER BY id
        ) AS occurrence
    FROM students
)
DELETE FROM CTE WHERE occurrence > 1;

๐Ÿ’ป Method 3: Using a Temporary Table (Universal) ๐Ÿ“‘

This works in every database and is very safe.

  1. Select distinct records into a new table.
    CREATE TABLE students_clean AS 
    SELECT DISTINCT * FROM students;
    
  2. Drop the old table.
    DROP TABLE students;
    
  3. Rename the clean table.
    RENAME students_clean TO students;
    

๐ŸŽจ Visual Logic: The Sieve Method

graph TD
    A[Raw Data] --> B{Group by Unique Fields}
    B --> C[Keep 1st Instance โœ…]
    B --> D[Send others to DELETE ๐Ÿ—‘๏ธ]

๐Ÿ’ก Prevention Tip

"The best way to handle duplicates is to prevent them! Always use UNIQUE CONSTRAINTS on your email or phone number columns so the database rejects the duplicate before it even gets in."


๐Ÿ“ˆ Learning Path