How to test for significant relationship between two categorical columns of an R data frame?


To test for the significance of proportion between two categorical columns of an R data frame, we first need to find the contingency table using those columns and then apply the chi square test for independence using chisq.test. For example, if we have a data frame called df that contains two categorical columns say C1 and C2 then the test for significant relationship can be done by using the command chisq.test(table(df$C1,df$C2))

Example

 Live Demo

x1<-sample(LETTERS[1:4],20,replace=TRUE)
y1<-sample(letters[1:4],20,replace=TRUE)
df1<-data.frame(x1,y1)
df1

Output

   x1 y1
1  D  a
2  B  d
3  D  d
4  B  d
5  A  a
6  A  b
7  B  c
8  D  d
9  C  d
10 D  c
11 C  a
12 D  c
13 D  a
14 A  a
15 B  d
16 A  c
17 C  d
18 A  d
19 C  b
20 D  a

Example

table(df1$x1,df1$y1)

Output

   a  b  c  d
A  2  1  1  1
B  0  0  1  3
C  1  1  0  2
D  3  0  2  2

Finding significant relationship between columns x1 and y1 of df1 −

Example

chisq.test(table(df1$x1,df1$y1))

Output

   Pearson's Chi-squared test
data: table(df1$x1, df1$y1)
X-squared = 7.4464, df = 9, p-value = 0.5907
Warning message:
In chisq.test(table(df1$x1, df1$y1)) :
Chi-squared approximation may be incorrect

Example

 Live Demo

x2<-sample(c("hot","cold"),20,replace=TRUE)
y2<-sample(c("summer","winter","spring"),20,replace=TRUE)
df2<-data.frame(x2,y2)
df2

Output

    x2    y2
1  cold  winter
2  hot   winter
3  hot   winter
4  hot   spring
5  cold  summer
6  cold  summer
7  cold  spring
8  hot   winter
9  cold  summer
10 hot   spring
11 hot   winter
12 cold  winter
13 cold  winter
14 hot   summer
15 hot   winter
16 hot   summer
17 hot   summer
18 cold  summer
19 cold   spring
20 hot   summer

Example

table(df2$x2,df2$y2)

Output

spring summer winter
cold 2 4 3
hot 2 4 5

Finding significant relationship between columns x2 and y2 of df2 −

Example

chisq.test(table(df2$x2,df2$y2))

Output

   Pearson's Chi-squared test
data: table(df2$x2, df2$y2)
X-squared = 0.30303, df = 2, p-value = 0.8594
Warning message:
In chisq.test(table(df2$x2, df2$y2)) :
Chi-squared approximation may be incorrect

Updated on: 17-Mar-2021

478 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements