Differentiate between categorical and numerical independent variables in R.

R ProgrammingServer Side ProgrammingProgramming

For categorical variable, each level is considered as an independent variable and is recognized by factor function. On the other hand, the numerical independent variable is either continuous or discrete in nature.

Check out the Example given below for linear regression model summary to understand the difference between categorical and numerical independent variables.

Example

Following snippet creates a sample data frame −

x<-rpois(20,2)
y<-rpois(20,5)
df<-data.frame(x,y)
df

The following dataframe is created

   x y
1  1 1
2  4 5
3  3 10
4  3 4
5  1 6
6  3 4
7  1 2
8  1 10
9  1 6
10 2 5
11 1 2
12 3 4
13 0 5
14 1 5
15 4 5
16 4 7
17 3 5
18 2 4
19 1 3
20 2 6

To create linear model for data in df and find the model summary on the above created data frame, add the following code to the above snippet −

x<-rpois(20,2)
y<-rpois(20,5)
df<-data.frame(x,y)
Model_1<-lm(y~x,data=df)
summary(Model_1)

Output

If you execute all the above given snippets as a single program, it generates the following Output −

Call:
lm(formula = y ~ x, data = df)

Residuals:
   Min     1Q  Median    3Q   Max
-3.549 -1.313  -0.503 1.128 5.451
Coefficients:
         Estimate Std. Error t value Pr(|t|)
(Intercept) 4.168      1.013   4.11  0.00065 ***
x           0.382      0.426   0.90  0.38249
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.29 on 18 degrees of freedom
Multiple R-squared: 0.0426, Adjusted R-squared: -0.0106
F-statistic: 0.801 on 1 and 18 DF, p-value: 0.382

To create linear model for data in df with as a factor variable and find the model summary on the above created data frame, add the following code to the above snippet −

x<-rpois(20,2)
y<-rpois(20,5)
df<-data.frame(x,y)
Model_1<-lm(y~x,data=df)
Model_2<-lm(y~factor(x),data=df)
summary(Model_2)

Output

If you execute all the above given snippets as a single program, it generates the following Output −

Call:
lm(formula = y ~ factor(x), data = df)

Residuals:
   Min     1Q  Median    3Q   Max
-3.375 -1.400  -0.533 1.083 5.625

Coefficients:
           Estimate Std.   Error t value   Pr(|t|)
(Intercept) 5.00e+00     2.50e+00  2.00    0.064 .
factor(x)1 -6.25e-01     2.65e+00 -0.24    0.817
factor(x)2 -3.92e-15     2.89e+00  0.00    1.000
factor(x)3  4.00e-01     2.74e+00  0.15    0.886
factor(x)4  6.67e-01     2.89e+00  0.23    0.820
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.5 on 15 degrees of freedom
Multiple R-squared: 0.0526, Adjusted R-squared: -0.2
F-statistic: 0.208 on 4 and 15 DF, p-value: 0.93
raja
Published on 03-Nov-2021 08:02:54
Advertisements