Handling duplicate records in a database is a common task for data engineers and analysts. In this blog post, we will walk you through a step-by-step guide to remove duplicate records from a Snowflake table, using an example to demonstrate the process.
There are multiple ways to remove the duplicates but in this post, we will focus on one particularly efficient technique to remove duplicates and provide a detailed explanation to help you understand and apply it effectively in your work.
1.Set up the sample table and data.
To illustrate the process of removing duplicates, we will use a sample customer
table with the following schema:
CREATE TABLE customer (
id INT AUTOINCREMENT PRIMARY KEY,
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(100)
);
We will insert sample data into the customer
table, including some duplicate records:
INSERT INTO customer (first_name, last_name, email)
VALUES ('Suresh', 'Reddy', 'suresh.reddy@example.com'),
('Suresh', 'J', 'Suresh.smith@example.com'),
('Shiva', 'Balaji', 'bala.siva@example.com'),
('Suresh', 'Reddy', 'suresh.reddy@example.com'),
('Shiva', 'Balaji', 'bala.siva@example.com');
Step 2: Identify and remove duplicate records using SQL
Notice that there are duplicate records for Suresh Reddy and Shiva Balaji.
SELECT
id,
ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email ORDER BY id) AS row_num
FROM
customer
The above query works by utilizing the ROW_NUMBER() window function to assign a unique row number to each row within a partition. The partition is defined by the combination of first_name
, last_name
, and email
columns
DELETE FROM CUSTOMER WHERE ID IN (
SELECT id FROM (
SELECT
id,
ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email ORDER BY id) AS row_num
FROM
customer
)
WHERE row_num > 1
);
The DELETE statement then removes all rows whose id
values appear in the subquery with a row_num
greater than 1.
Step 3: Verify the results
To ensure the duplicate records have been removed from the customer
table, run a SELECT query:
SELECT * FROM customer;
The result should display the remaining unique records:
Removing duplicate records from a Snowflake table is essential for maintaining clean and accurate data. By using the ROW_NUMBER() window function and a DELETE statement, you can effectively eliminate duplicate records from your Snowflake tables. With this step-by-step guide and example, you should be well-equipped to handle duplicate records in your Snowflake database.
Please share this guide with others who may find it helpful, and don’t hesitate to comment or ask any questions.