How to remove only the first duplicate row by group in an R data frame?

R ProgrammingServer Side ProgrammingProgramming

To remove only the first duplicate row by group, we can use filter function of dplyr package with duplicated function.

For example, if we have a data frame called df that contains a grouping column say Grp then removal of only first duplicate row by group can be done by using the below command as follows −

df%>%group_by(Grp)%>%filter(duplicated(Grp)|n()==1)

Example 1

Following snippet creates a sample data frame −

Group<-sample(LETTERS[1:4],20,replace=TRUE)
Response<-rpois(20,5)
df1<-data.frame(Group,Response)
df1

Output

The following dataframe is created −

 Group Response
1  D   9
2  A   3
3  B   4
4  A   5
5  B   8
6  B   8
7  D   2
8  D   5
9  B   4
10 C   4
11 D   7
12 D   5
13 C   5
14 A   2
15 B   5
16 A   9
17 B   6
18 C   8
19 D   3
20 A   7

To load dplyr package and remove only first duplicate row from each group in df1, add the following code to the above snippet −

library(dplyr)
df1%>%group_by(Group)%>%filter(duplicated(Group)|n()==1)
# A tibble: 16 x 2
# Groups: Group [4]

Output

If you execute all the above given codes as a single program, it generates the following output −

 Group Response
 <chr> <int>
1  A    5
2  B    8
3  B    8
4  D    2
5  D    5
6  B    4
7  D    7
8  D    5
9  C    5
10 A    2
11 B    5
12 A    9
13 B    6
14 C    8
15 D    3
16 A    7

Example 2

Following snippet creates a sample data frame −

Category<-sample(c("First","Second","Third"),20,replace=TRUE)
Rank<-sample(1:10,20,replace=TRUE)
df2<-data.frame(Category,Rank)
df2

Output

The following dataframe is created −

 Category Rank
1  Second  10
2  Second   5
3  Second   4
4  Third    3
5  Second   5
6  Second   9
7  First    6
8  Second  10
9  First    9
10 Third    1
11 First    8
12 Second   3
13 Second   5
14 Third    1
15 Third    2
16 Second   4
17 Second   6
18 Third    6
19 Second   2
20 Second   9

To remove only first duplicate row from each group in df2, add the following code to the above snippet −

df2%>%group_by(Category)%>%filter(duplicated(Category)|n()==1)
# A tibble: 17 x 2
# Groups: Category [3]

Output

If you execute all the above given codes as a single program, it generates the following output −

  Category Rank
  <chr>    <int>
1  Second   5
2  Second   4
3  Second   5
4  Second   9
5  Second  10
6  First    9
7  Third    1
8  First    8
9  Second   3
10 Second   5
11 Third    1
12 Third    2
13 Second   4
14 Second   6
15 Third    6
16 Second   2
17 Second   9
raja
Updated on 06-Nov-2021 07:24:13

Advertisements