How to write a common procedure to find and remove duplicates from any table and columns in Oracle?

Problem Statement:
You want to write a common procedure to find and remove duplicates from any table and columns in Oracle.

Solution:
We can use Oracle’s internal ROWID value for uniquely identifying rows in a table along with the OLAP function row_number with partition clause. The sample syntax to acheive this would like below.

delete from table
where rowid in
  (... query here ...)

To demonstrate the usage, we will begin by creating sample data.

Example

-- table with tennis player rankings
DROP TABLE atp_stats;

CREATE TABLE atp_stats
( player_rank NUMBER NOT NULL,
  player_name VARCHAR2(100) NOT NULL,
  time_range  TIMESTAMP(6));

-- sample records
INSERT INTO atp_stats VALUES (1,'ROGER FEDERER',CURRENT_TIMESTAMP);
INSERT INTO atp_stats VALUES (2,'RAFAEL NADAL',CURRENT_TIMESTAMP);
INSERT INTO atp_stats VALUES (3,'NOVAK DJOKOVIC',CURRENT_TIMESTAMP);
INSERT INTO atp_stats VALUES (4,'ANDY MURRAY',CURRENT_TIMESTAMP);
INSERT INTO atp_stats VALUES (1,'ROGER FEDERER',CURRENT_TIMESTAMP);
INSERT INTO atp_stats VALUES (2,'RAFAEL NADAL',CURRENT_TIMESTAMP);
INSERT INTO atp_stats VALUES (3,'NOVAK DJOKOVIC',CURRENT_TIMESTAMP);
COMMIT;

Looking at the data we just created.

Example

SELECT * FROM atp_stats ORDER BY 2;

player_rank	player_name
4	ANDY MURRAY
3	NOVAK DJOKOVIC
3	NOVAK DJOKOVIC
2	RAFAEL NADAL
2	RAFAEL NADAL
1	ROGER FEDERER
1	ROGER FEDERER

So, we have inserted 3 duplciates which we wanted to remove. before we go on and write a Delete statement, let us understand the inner query with ROWID.

Example

SELECT rowid
   FROM (
SELECT player_rank,
  player_rank,
  rowid ,
  row_number() over (partition BY player_rank, player_name order by player_rank,player_name) AS rnk
FROM atp_stats
)
WHERE rnk > 1;

I had intentionally added the columns player_rank and player_name to this innermost subquery to make the logic understandable. Ideally, innermost subquery could be written without them to the same effect. If we execute just this innermost query offcourse with the extra columns selected for clarity, we see these results.

player_rank	player_name	rowid	rnk
4	ANDY MURRAY	AAAPHcAAAAAB/4TAAD	1
3	NOVAK DJOKOVIC	AAAPHcAAAAAB/4TAAC	1
3	NOVAK DJOKOVIC	AAAPHcAAAAAB/4TAAG	2
2	RAFAEL NADAL	AAAPHcAAAAAB/4TAAB	1
2	RAFAEL NADAL	AAAPHcAAAAAB/4TAAF	2
1	ROGER FEDERER	AAAPHcAAAAAB/4TAAE	1
1	ROGER FEDERER	AAAPHcAAAAAB/4TAAA	2

The SQL returns the rowid for all the rows in the table. The ROW_NUMBER() function then works over sets of id and player_name driven by the PARTITION BY instruction. This means that for every unique player_rank and player_name, ROW_NUMBER will start a running count of rows we have aliased as rnk. When a new player_rank and player_name combination is observed, the rnk counter resets to 1.

Now we can apply the DELETE operator to remove the duplicate values as below.

SQL: Remove duplicates

Example

DELETE
  FROM atp_stats
 WHERE rowid IN (
                   SELECT rowid
                     FROM(
                          SELECT player_rank,  player_name,
                            rowid ,
                            row_number() over (partition BY player_rank, player_name order by player_rank,player_name) AS rnk
                          FROM atp_stats
                         )
                     WHERE rnk > 1
                );

Output

3 rows deleted.

player_rank	player_name
4	ANDY MURRAY
3	NOVAK DJOKOVIC
2	RAFAEL NADAL
1	ROGER FEDERER

As removing duplicates is one of the most common task a programmer performs, it would be nice to have a procedure created for reusability. Below procedure is will accept the table name from which the duplicates had to be removed along with the column names for search.

First up we will create a table type for passing dynamic number of columns to be grouped. We will then create a procedure to delete the data dynamically.

Code: Common procedure to remove duplicates

Example

CREATE OR REPLACE TYPE tmp_args AS TABLE OF VARCHAR2(30);

CREATE PROCEDURE remove_duplicates
 (p_table IN VARCHAR2,
  p_cols  tmp_args)
 AS
   l_remve_dupl   CLOB ;
   l_columns      VARCHAR2(30);
   l_sql_count    NUMBER;
 
 BEGIN
   
    -- get the columns and combine them as comma seperated value
    SELECT LISTAGG(COLUMN_VALUE, ',') WITHIN GROUP(ORDER BY COLUMN_VALUE) INTO l_columns FROM TABLE(p_cols);
   
    -- generate dynamic delete statement
    SELECT 'DELETE FROM ' || p_table  ||
           ' WHERE rowid IN (
                             SELECT rowid
                                   FROM(
                                        SELECT rowid ,
                                          row_number() OVER (partition BY ' || l_columns || ' ORDER BY ' || l_columns || ') AS rnk
                                        FROM ' || p_table || '
                                       )
                                   WHERE rnk > 1
                ) ' INTO l_remve_dupl  FROM DUAL ;
   
    EXECUTE IMMEDIATE l_remve_dupl;
    l_sql_count := SQL%ROWCOUNT;
    COMMIT;
 END;

Usage:

BEGIN
   remove_duplicates('ATP_STATS', tmp_args('PLAYER_RANK','PLAYER_NAME'));
END;

Output

player_rank	player_name
4	ANDY MURRAY
3	NOVAK DJOKOVIC
2	RAFAEL NADAL
1	ROGER FEDERER

Kiran P

Updated on: 2020-12-04T10:35:34+05:30

724 Views

Kickstart Your Career

Get certified by completing the course

Get Started