Random sampling is an important part of data analysis, mostly we need to create a random sample based on rows instead of columns because rows represent the cases. To create a random sample of some percentage of rows for a particular value of a column from an R data frame we can use sample function with which function.
Consider the below data frame −
set.seed(887) grp<-sample(LETTERS[1:4],20,replace=TRUE) Score<-sample(101:150,20) df1<-data.frame(grp,Score) df1
grp Score 1 D 135 2 D 114 3 C 121 4 C 150 5 B 129 6 A 110 7 D 126 8 D 132 9 C 118 10 D 102 11 B 103 12 D 145 13 A 128 14 C 147 15 B 106 16 B 125 17 D 130 18 B 131 19 A 142 20 C 143
Randomly sampling fifty percent of rows based on A of column grp −
df1[sample(which(df1$grp=='A'),round(0.5*length(which(df1$grp=='A')))),]
grp Score 2 A 138 20 A 125
Let’s have a look at another example −
y1<-sample(c("YT1","YT2","YT3"),20,replace=TRUE) y2<-rnorm(20,10,1) df2<-data.frame(y1,y2) df2
y1 y2 1 YT2 10.886273 2 YT1 9.534332 3 YT1 8.353436 4 YT1 10.878407 5 YT2 9.881384 6 YT2 9.825197 7 YT3 8.805524 8 YT3 10.189767 9 YT1 11.615293 10 YT1 10.194561 11 YT3 10.317023 12 YT1 11.570260 13 YT1 9.488106 14 YT2 10.340876 15 YT2 7.425779 16 YT2 10.085891 17 YT1 11.023932 18 YT2 10.301987 19 YT3 10.234140 20 YT1 9.048794
Randomly sampling thirty percent of rows based on YT1 of column y1 −
df2[sample(which(df2$y1=='YT1'),round(0.3*length(which(df2$y1=='YT1')))),]
y1 y2 2 YT1 10.400617 13 YT1 8.977768