- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to remove rows for categorical columns that has three or less combination of duplicates in an R data frame?
In Data Analysis, we sometimes decide the size of the data or sample size based on our thoughts and this might result in removing some part of the data. One such thing could be removing three or less duplicate combinations of categorical columns and it can be done with the help of filter function of dplyr package by grouping with group_by function.
Example1
Consider the below data frame −
set.seed(121) x1<−sample(LETTERS[1:6],20,replace=TRUE) x2<−sample(c("Male","Female"),20,replace=TRUE) x3<−rpois(20,5) df1<−data.frame(x1,x2,x3) df1
Output
x1 x2 x3 1 D Female 5 2 D Female 2 3 D Male 7 4 D Female 8 5 A Male 6 6 C Female 7 7 A Female 3 8 C Female 1 9 C Female 7 10 E Male 2 11 D Female 3 12 E Female 6 13 F Female 3 14 D Female 4 15 A Male 4 16 E Male 4 17 B Female 8 18 B Female 7 19 C Female 5 20 A Female 9
Loading dplyr package and removing categorical columns that has three or less combination of duplicates −
Example
library(dplyr) df1%>%group_by(x1,x2)%>%filter(n()>=4) # A tibble: 9 x 3 # Groups: x1, x2 [2]
Output
x1 x2 x3 <chr> <chr> <int> 1 D Female 5 2 D Female 2 3 D Female 8 4 C Female 7 5 C Female 1 6 C Female 7 7 D Female 3 8 D Female 4 9 C Female 5
Example2
y1<−sample(c("S1","S2","S3","S4","S5","S6"),20,replace=TRUE) y2<−sample(c("Winter","Summer"),20,replace=TRUE) y3<−rnorm(20,3) df2<−data.frame(y1,y2,y3) df2
Output
y1 y2 y3 1 S1 Winter 2.683082 2 S4 Summer 1.141916 3 S6 Winter 3.371681 4 S2 Winter 3.191187 5 S3 Summer 2.195504 6 S5 Summer 2.631736 7 S3 Winter 3.303605 8 S6 Summer 3.074344 9 S5 Summer 2.663724 10 S5 Winter 2.281991 11 S6 Summer 4.174418 12 S4 Winter 6.081246 13 S4 Summer 3.202913 14 S2 Winter 5.557243 15 S2 Winter 3.747462 16 S2 Winter 2.621571 17 S2 Summer 3.909743 18 S5 Winter 2.325663 19 S5 Summer 3.749852 20 S5 Winter 2.331191
Example
df2%>%group_by(y1,y2)%>%filter(n()>=4) # A tibble: 4 x 3 # Groups: y1, y2 [1]
Output
y1 y2 y3 <chr> <chr> <dbl> 1 S2 Winter 3.19 2 S2 Winter 5.56 3 S2 Winter 3.75 4 S2 Winter 2.62
Advertisements