Search code examples
rstatisticschi-squaredgoodness-of-fit

R How to generate a vector of probabilities normally distributed to be used at chisq.test


I have a vector of 30 samples I want to test the hypothesis of the sample being from a population which is normally distributed.

> N.concentration
  [1] 0.164 0.045 0.069 0.100 0.050 0.080 0.043 0.036 0.057 0.154 0.133 0.193
  [13] 0.129 0.121 0.081 0.178 0.041 0.040 0.116 0.078 0.104 0.095 0.116 0.038
  [25] 0.141 0.100 0.104 0.078 0.121 0.104

I made a frequency vector using hist

> N.hist <- hist(N.concentration, breaks=10)
> N.freq <- N.hist$count
  [1] 3 5 4 4 5 4 2 2 1

I'm using chisq.test to check the fitness of N.freq to a normal distribution, however, the function requires an argument p = a vector of probabilities of the same length of x, as defined in chisq.test documentation. I'm trying to generate a vector to it but, honestly, I don't know exactly what to generate. I'm trying

> d <- length(N.freq$count)%/%2
> p <- dnorm(c(-d:d))
> p
  [1] 0.0001338302 0.0044318484 0.0539909665 0.2419707245 0.3989422804
  [6] 0.2419707245 0.0539909665 0.0044318484 0.0001338302
> chisq.test(N.freq, p = p)
   Error in chisq.test(p1$count, p = p) : 
   probabilities must sum to 1.

I thought about using rescale.p=TRUE but I'm not sure if this will produce a valid test.


EDIT: If I use rescale.p, I got a warning message

> chisq.test(N.freq, p=p, rescale.p=TRUE)

Chi-squared test for given probabilities

data:  N.freq
X-squared = 2697.7, df = 8, p-value < 2.2e-16

Warning message:
In chisq.test(N.freq, p = p, rescale.p = TRUE) :
Chi-squared approximation may be incorrect

Solution

  • As I said, to test normality we have to know the mean and standard error of the normal distribution in Null Hypothesis. Since there are no given values, we have to estimate them from your 30 data.

    x <- c(0.164, 0.045, 0.069, 0.1, 0.05, 0.08, 0.043, 0.036, 0.057, 
    0.154, 0.133, 0.193, 0.129, 0.121, 0.081, 0.178, 0.041, 0.04, 
    0.116, 0.078, 0.104, 0.095, 0.116, 0.038, 0.141, 0.1, 0.104, 
    0.078, 0.121, 0.104)
    
    mu <- mean(x)
    sig <- sd(x)
    

    Now, as what you have done, we need to bin the data:

    h <- hist(x, breaks = 10)
    #List of 6
    # $ breaks  : num [1:10] 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
    # $ counts  : int [1:9] 3 5 4 4 5 4 2 2 1
    # $ density : num [1:9] 5 8.33 6.67 6.67 8.33 ...
    # $ mids    : num [1:9] 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
    # $ xname   : chr "x"
    # $ equidist: logi TRUE
    # - attr(*, "class")= chr "histogram"
    

    To get the true probability under Null Hypothesis, we need probability for each bin cell, i.e., between breaks.

    p <- diff(pnorm(h$breaks, mu, sig))
    #[1] 0.05675523 0.10254734 0.15053351 0.17953337 0.17396679 0.13696059 0.08760419
    #[8] 0.04552387 0.01921839
    

    I tend not to trust chi-square test with only 30 data. But here is how we can use chisq.test:

    chisq.test(h$counts, p = p, rescale.p = TRUE)
    #
    #   Chi-squared test for given probabilities
    #
    #data:  h$counts
    #X-squared = 3.1476, df = 8, p-value = 0.9248
    #
    #Warning message:
    #In chisq.test(h$counts, p, rescale.p = TRUE) :
    #  Chi-squared approximation may be incorrect
    

    Often you need not bother the warning message. If you want to get rid of it, set simulate.p.value = TRUE:

    chisq.test(h$counts, p = p, rescale.p = TRUE, simulate.p.value = TRUE)
    #
    #   Chi-squared test for given probabilities with simulated p-value (based
    #   on 2000 replicates)
    #
    #data:  h$counts
    #X-squared = 3.1476, df = NA, p-value = 0.942