How to remove rows for categorical columns that has three or less combination of duplicates in an R data frame?


In Data Analysis, we sometimes decide the size of the data or sample size based on our thoughts and this might result in removing some part of the data. One such thing could be removing three or less duplicate combinations of categorical columns and it can be done with the help of filter function of dplyr package by grouping with group_by function.

Example1

 Live Demo

Consider the below data frame −

set.seed(121)
x1<−sample(LETTERS[1:6],20,replace=TRUE)
x2<−sample(c("Male","Female"),20,replace=TRUE)
x3<−rpois(20,5)
df1<−data.frame(x1,x2,x3)
df1

Output

x1 x2 x3
1 D Female 5
2 D Female 2
3 D Male 7
4 D Female 8
5 A Male 6
6 C Female 7
7 A Female 3
8 C Female 1
9 C Female 7
10 E Male 2
11 D Female 3
12 E Female 6
13 F Female 3
14 D Female 4
15 A Male 4
16 E Male 4
17 B Female 8
18 B Female 7
19 C Female 5
20 A Female 9

Loading dplyr package and removing categorical columns that has three or less combination of duplicates −

Example

library(dplyr)
df1%>%group_by(x1,x2)%>%filter(n()>=4)
# A tibble: 9 x 3
# Groups: x1, x2 [2]

Output

x1 x2 x3
<chr> <chr> <int>
1 D Female 5
2 D Female 2
3 D Female 8
4 C Female 7
5 C Female 1
6 C Female 7
7 D Female 3
8 D Female 4
9 C Female 5

Example2

 Live Demo

y1<−sample(c("S1","S2","S3","S4","S5","S6"),20,replace=TRUE)
y2<−sample(c("Winter","Summer"),20,replace=TRUE)
y3<−rnorm(20,3)
df2<−data.frame(y1,y2,y3)
df2

Output

y1 y2 y3
1 S1 Winter 2.683082
2 S4 Summer 1.141916
3 S6 Winter 3.371681
4 S2 Winter 3.191187
5 S3 Summer 2.195504
6 S5 Summer 2.631736
7 S3 Winter 3.303605
8 S6 Summer 3.074344
9 S5 Summer 2.663724
10 S5 Winter 2.281991
11 S6 Summer 4.174418
12 S4 Winter 6.081246
13 S4 Summer 3.202913
14 S2 Winter 5.557243
15 S2 Winter 3.747462
16 S2 Winter 2.621571
17 S2 Summer 3.909743
18 S5 Winter 2.325663
19 S5 Summer 3.749852
20 S5 Winter 2.331191

Example

df2%>%group_by(y1,y2)%>%filter(n()>=4)
# A tibble: 4 x 3
# Groups: y1, y2 [1]

Output

y1 y2 y3
<chr> <chr> <dbl>
1 S2 Winter 3.19
2 S2 Winter 5.56
3 S2 Winter 3.75
4 S2 Winter 2.62

Updated on: 08-Feb-2021

278 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements