How to combine the levels of a factor variable in an R data frame?



An R data frame can have numeric as well as factor variables. It has been seen that, factor levels in the raw data are recorded as synonyms even in different language versions but it is rare. For example, a factor variable can have hot and cold as levels but it is possible that hot is recorded as garam by a Hindi native speaker because garam is Hindi form of hot. Therefore, we need to combine the similar levels into one so that we do not have unnecessary factor levels for a variable.

Example

Consider the below data frame −

set.seed(109)
x1<-rep(c("Sweet","Meetha","Bitter","Salty"),times=5)
x2<-sample(1:100,20)
x3<-rpois(20,5)
df1<-data.frame(x1,x2,x3)
df1

Output

   x1 x2 x3
1 Sweet 8 4
2 Meetha 22 6
3 Bitter 25 3
4 Salty 85 10
5 Sweet 90 13
6 Meetha 10 0
7 Bitter 55 7
8 Salty 92 7
9 Sweet 95 4
10 Meetha 31 4
11 Bitter 5 4
12 Salty 56 6
13 Sweet 32 4
14 Meetha 78 6
15 Bitter 16 10
16 Salty 48 9
17 Sweet 49 4
18 Meetha 35 4
19 Bitter 37 9
20 Salty 11 8

Since Meetha is the Hindi version of Sweet, we might want to convert Meetha to Sweet and it can be done as shown below −

Example

levels(df1$x1)[levels(df1$x1)=="Meetha"] <-"Sweet"
df1

Output

x1 x2 x3
1 Sweet 8 4
2 Sweet 22 6
3 Bitter 25 3
4 Salty 85 10
5 Sweet 90 13
6 Sweet 10 0
7 Bitter 55 7
8 Salty 92 7
9 Sweet 95 4
10 Sweet 31 4
11 Bitter 5 4
12 Salty 56 6
13 Sweet 32 4
14 Sweet 78 6
15 Bitter 16 10
16 Salty 48 9
17 Sweet 49 4
18 Sweet 35 4
19 Bitter 37 9
20 Salty 11 8

Let’s have a look at another example −

Example

ID <-1:20
Class<-rep(c("First","Second","Third","Fourth","One"),each=4)
df2<-data.frame(ID,Class)
df2

Output

ID Class
1 1 First
2 2 First
3 3 First
4 4 First
5 5 Second
6 6 Second
7 7 Second
8 8 Second
9 9 Third
10 10 Third
11 11 Third
12 12 Third
13 13 Fourth
14 14 Fourth
15 15 Fourth
16 16 Fourth
17 17 One
18 18 One
19 19 One
20 20 One

Example

levels(df2$Class)[levels(df2$Class)=="One"] <-"First"
df2

Output

ID Class
1 1 First
2 2 First
3 3 First
4 4 First
5 5 Second
6 6 Second
7 7 Second
8 8 Second
9 9 Third
10 10 Third
11 11 Third
12 12 Third
13 13 Fourth
14 14 Fourth
15 15 Fourth
16 16 Fourth
17 17 First
18 18 First
19 19 First
20 20 First

Advertisements