How to subset columns that has less than four categories in an R data frame?


If column is categorical then there can be at least two categories and there is no limit for the total number of categories but it will also depend on the total number of cases. If we have a data frame that contain some categorical columns having more or less categories than 4 then we might want to subset columns having less than four categories. This could be required in situations when we want to subset the data biasedly or have some predefined data characteristics that allows this change. The subset of such columns can be done with the help of sapply function as shown in the below examples.

Example1

Consider the below data frame −

Live Demo

> x1<-sample(c("Hot","Cold","Warm"),20,replace=TRUE)
> x2<-sample(c("Male","Female"),20,replace=TRUE)
> x3<-sample(letters[1:4],20,replace=TRUE)
> df1<-data.frame(x1,x2,x3)
> df1

Output

   x1       x2  x3
1  Warm   Male  b
2  Cold Female  c
3  Cold   Male  a
4  Hot    Male  d
5  Hot    Male  d
6  Hot  Female  a
7  Hot    Male  a
8  Cold Female  d
9  Warm   Male  d
10 Warm Female  d
11 Cold   Male  a
12 Cold Female  c
13 Hot    Male  b
14 Warm   Male  c
15 Cold   Male  b
16 Warm   Male  a
17 Hot    Male  b
18 Cold   Male  b
19 Hot  Female  c
20 Warm Female  d

Finding the subset of columns that have less than 4 categories in df1 −

> df1[,sapply(df1, function(col) length(unique(col)))<4]

Output

    x1    x2
1  Warm   Male
2  Cold Female
3  Cold   Male
4  Hot    Male
5  Hot    Male
6  Hot  Female
7  Hot    Male
8  Cold Female
9  Warm   Male
10 Warm Female
11 Cold   Male
12 Cold Female
13 Hot    Male
14 Warm   Male
15 Cold   Male
16 Warm   Male
17 Hot    Male
18 Cold   Male
19 Hot  Female
20 Warm Female

Example2

Live Demo

> y1<-sample(c("Male","Female"),20,replace=TRUE)
> y2<-sample(letters[1:5],20,replace=TRUE)
> y3<-sample(c("Asian","American","Chinese"),20,replace=TRUE)
> df2<-data.frame(y1,y2,y3)
> df2

Output

     y1   y2    y3
1   Male  b  Chinese
2  Female b  American
3  Female d  Asian
4  Female e  American
5  Female e  Asian
6  Female c  Chinese
7  Female a  Chinese
8  Female a  Chinese
9   Male  d  American
10 Female d  Chinese
11 Female d  Chinese
12 Female c  American
13 Female b  American
14   Male d  Chinese
15   Male a  American
16   Male e  Asian
17   Male b  Asian
18 Female d  Chinese
19 Female d  Chinese
20 Female c  Asian

Finding the subset of columns that have less than 4 categories in df2 −

> df2[,sapply(df2, function(col) length(unique(col)))<4]

Output

    y1      y3
1   Male  Chinese
2  Female American
3  Female Asian
4  Female American
5  Female Asian
6  Female Chinese
7  Female Chinese
8  Female Chinese
9    Male American
10 Female Chinese
11 Female Chinese
12 Female American
13 Female American
14   Male Chinese
15   Male American
16   Male Asian
17   Male Asian
18 Female Chinese
19 Female Chinese
20 Female Asian

Updated on: 05-Mar-2021

77 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements