Search code examples
rtestingrandomp-value

Am I understanding something wrong about randomization?


I thought randomization 'equalizes all factors (whether observed or not)' between the treatment group and control group.

To test this, I have performed the code below, and found out that in more than half of the cases randomization did not work well - meaning at least one variable was statistically different between randomly split treatment group and control group.


set.seed(1234)
for (i in 1:1000){
  
  ind<-sample(2, 10000, replace=TRUE, prob=c(0.5, 0.5))
  
  a<-as.matrix(rnorm(10000, mean=0, sd=1))
  b<-as.matrix(rnorm(10000, mean=0.5, sd=1)) 
  c<-as.matrix(rnorm(10000, mean=1, sd=2))
  dt<-data.frame(cbind(a,b,c))
  dt$X4 <- dt$X1 + dt$X2
  dt$X5 <- dt$X1 * dt$X3
  
  dt1<-dt[ind==1,]
  dt2<-dt[ind==2,]
  
  a_pval[i]<-t.test(dt1[1,], dt2[1,])$p.value
  b_pval[i]<-t.test(dt1[2,], dt2[2,])$p.value
  c_pval[i]<-t.test(dt1[3,], dt2[3,])$p.value
  d_pval[i]<-t.test(dt1[4,], dt2[4,])$p.value
  e_pval[i]<-t.test(dt1[5,], dt2[5,])$p.value
}

pval<-data.frame(cbind(a_pval,b_pval,c_pval,d_pval,e_pval))

pval<-mutate(pval, adiff = ifelse(a_pval<0.05, 1,0))
pval<-mutate(pval, bdiff = ifelse(b_pval<0.05, 1,0))
pval<-mutate(pval, cdiff = ifelse(c_pval<0.05, 1,0))
pval<-mutate(pval, ddiff = ifelse(d_pval<0.05, 1,0))
pval<-mutate(pval, ediff = ifelse(e_pval<0.05, 1,0))
pval$diff<-pval$adiff+pval$bdiff+pval$cdiff+pval$ddiff+pval$ediff

table(pval$diff)

length(which(a_pval<0.05))
length(which(b_pval<0.05))
length(which(c_pval<0.05))
length(which(d_pval<0.05))
length(which(e_pval<0.05))

Is it because there is something wrong with my code?


Solution

  • I don't think the tests are doing what you think they're doing. Your t tests are working on rows of your matrices, not columns, so the distribution isn't normal, it's a degenerate mixture of normals. Change the t test lines to

    a_pval[i]<-t.test(dt1[,1], dt2[,1])$p.value
    b_pval[i]<-t.test(dt1[,2], dt2[,2])$p.value
    c_pval[i]<-t.test(dt1[,3], dt2[,3])$p.value
    d_pval[i]<-t.test(dt1[,4], dt2[,4])$p.value
    e_pval[i]<-t.test(dt1[,5], dt2[,5])$p.value
    

    and that will be fixed, and you'll see that about 5% of your p-values are less than 0.05, as expected.

    I honestly don't understand what you were expecting to see in the pval$diff table. Since columns 4 and 5 are based on the first 3 columns, the columns are dependent, and you shouldn't expect to see a standard distribution of counts.