Search code examples
rstatisticschi-squared

R's code to obtain a histogram following a chi-square distribution from uniform random numbers


I had a code in my text-book (Written in Japanese) to generate a chi-square distribution with 3-degrees of freedom from a uniform distribution. I improved on this and created a code to get a histogram that follows a chi-square distribution with 4-degrees of freedom. This is in good agreement with the distribution function of R, so I think it probably works correctly (See Box1, below).

I tried to refine Box1's code further to obtain a histogram following a chi-square distribution with the specified degrees of freedom, but it didn't work with many errors. (See Box2)

My Question:

The Box2's code to generate a chi-square distribution from a uniform distribution does not work well.
Please help me to fix the errors of the Box2's Code.

Probably the generalization of " y<-ifelse(x<0.2,1,ifelse(x<0.4,2,ifelse(x<0.6,3,ifelse(x<0.8,4,5))))" didn't work in Box 2.

Box1:Code for obtaining a histogram that follows a chi-square distribution with 4 degrees of freedom (probably works correctly)

ite <- 10000
sc <- numeric(ite) #★1
A<- c(20,20,20,20,20) #★2
for(i in 1:ite){
  
  s<- runif(sum(A)*5) #★3
  y<-ifelse(s<0.2,1,ifelse(s<0.4,2,ifelse(s<0.6,3,ifelse(s<0.8,4,5))))  #★4
  z1 <- table(y)
  z2 <- A*5
  z3 <- (z1-z2)^2 /z2
  sc[i] <- sum(z3)
}

hist(sc,ylim=c(0,0.35),breaks="Scott",freq=F)
curve(dchisq(x,4),add=T)

The code for Box 1 is designed based on the following facts; If 500=sum(A)*5 uniform random numbers are divided into five rooms of the same size, the expected value of the number entering each room is 100. Here, 1st room, 2nd room,...,and 5th room are defined by 0≦x<0.2,0.2≦x<0.4,.... and, 0.8≦x≦1. We can see this from the output of table(y) in the following Box’ 1. Of course, sum(table(y)) of Box 1' always results in 500.

Box1' Logic for making uniform random numbers(x) stepwise(y) on Box1's code

A<- c(20,20,20,20,20)
s<- runif(sum(A)*5) #★3
y<-ifelse(s<0.2,1,ifelse(s<0.4,2,ifelse(s<0.6,3,ifelse(s<0.8,4,5))))
table(y)
sum(table(y))

Box2:Code for obtaining a histogram following a chi-square distribution of degrees of freedom n (With many errors)

chiq_dist_n<-function(numb,itr){
  A<-numeric(numb) #★2
  aa<-numeric(numb) #★4-1
  for(i in 1:numb){
    A[i]=20
  } #★2

  ntot=sum(A) 
  for(i in 1:numb){
    if (i ==1){aa[i]= A[i]/ntot                 
    }else{
      aa[i]=aa[i-1]+(A[i]/ntot) 
    }
  } #★4-2
 
  sc<-numeric(itr) #★1
  y<-numeric(ntot*numb) #★4-3
  
for(i in 1:itr){
    x<-runif(ntot*numb)
  for(k in 1:ntot*numb){
    for(j in 1:numb){
      if (x[k]<aa[numb-j+1]) {                 
        y[k]<-j               
      } else {}
    }    
}#★3
        
    z1<-table(y)
    z2<-A*ntot 
    z3<-(z1-z2)^2/z2
    sum(z3)
    sc[i]<-sum(z3)

  }
  return(sc)  
}

hist(chiq_dist(10,1000),ylim=c(0,0.35),breaks="Scott",freq=F)

The part of the Box2 code that generates y was cut out into Box2'. If you look at the table(y) of Box2', you can see that too many y[i] are zero. I want the output of table(y) in Box 2' to be roughly the same as the output of table(y) in Box 1'.

Box2' Logic for making uniform random numbers(x) stepwise(y) on Box2's code

A<- c(20,20,20,20,20)
ntot=sum(A)
numb=length(A)

aa<-numeric(numb)
for(i in 1:numb){
  if (i ==1){aa[i]= A[i]/ntot                 
  }else{
    aa[i]=aa[i-1]+(A[i]/ntot) 
  }
} #★4-2

y<-numeric(ntot*numb)
  x<-runif(ntot*numb)
  
  for(k in 1:ntot*numb){
    for(j in 1:numb){
      if (x[k]<aa[numb-j+1]) {                 
        y[k]<-j
      } else {}
    }
  }#★3

table(y)


Solution

  • You don't need a ifelse to break a random uniform distribution, you can just use cut() and specify the number of breaks, for example:

    set.seed(111)
    v = runif(10)
     [1] 0.59298128 0.72648112 0.37042200 0.51492383 0.37766322 0.41833733
     [7] 0.01065785 0.53229524 0.43216062 0.09368152
    
    cut(v,breaks=seq(0,1,length.out=numb+2),labels=1:5)
    
    [1] 3 4 2 3 2 3 1 3 3 1
    

    I am not so sure about A or what it does, but for simulating chisquare, I suppose you do a random sample of the labels 1:(df+1) where df is the degree of freedom. If we fix that number of samplings at 500, then we know that the expected for each break would be 500/(df+1).

    So without changing too much of your code.

    chiq_dist_n<-function(numb,ite){
    
    sc <- numeric(ite) 
    for(i in 1:ite){
      
      x<- runif(500) #★3
      y<- cut(x,breaks=seq(0,1,length.out=numb+2),labels=1:(numb+1))
      z1 <- table(y)
      z2 <- length(x)/(numb+1)
      z3 <- (z1-z2)^2 /z2
      sc[i] <- sum(z3)
    }
    
    hist(sc,ylim=c(0,0.35),breaks="Scott",freq=F,main=paste0("df=",numb))
    curve(dchisq(x,numb),add=T)
    }
    

    And we try from 4 to 9:

    par(mfrow=c(3,2))
    par(mar=c(2.5,2.5,2.5,2.5))
    for(i in seq(2,12,2)){
        chiq_dist_n(i,10000)
    }
    

    enter image description here