Search code examples
rstatisticsbernoulli-probability

Is this R code of Rao score test for the Bernoulli data model correct?


I am a complete statistical noob and new to R, hence the question. I've tried to find an implementation of the Rao score for the particular case when one's data is binary and each observation has bernoulli distribution. I stumbled upon anova in the R language but failed to understand how to use that. Therefore, I tried implementing Rao score for this particular case myself:

rao.score.bern <- function(data, p0) {
  # assume `data` is a list of 0s and 1s
  y <- sum(data)
  n <- length(data)
  phat <- y / n

  z <- (phat - p0) / sqrt(p0 * (1 - p0) / n)
  p.value <- 2 * (1 - pnorm(abs(z)))
}

I am pretty sure that there is a bug in my code because it produces only two distinct p-values in the following scenario:

p0 <- 1 / 4
p <- seq(from=0.01, to=0.5, by=0.01)
n <- seq(from=5, to=70, by=1)
g <- expand.grid(n, p)

data <- apply(g, 1, function(x) rbinom(x[1], 1, x[2]))
p.values <- sapply(data, function(x) rao.score.bern(x[[1]], p0))

Could someone please show me where the problem is? Could you perhaps point me to a built-in solution in R?


Solution

  • First test, then debug.

    Test

    Does rao.score.bern work at all?

    rao.score.bern(c(0,0,0,1,1,1), 1/6))

    This returns...nothing! Fix it by replacing the ultimate line by

    2 * (1 - pnorm(abs(z)))
    

    This eliminates the unnecessary assignment.

    rao.score.bern(c(0,0,0,1,1,1), 1/6))

    [1] 0.02845974
    

    OK, now we're getting somewhere.

    Debug

    Unfortunately, the code still doesn't work. Let's debug by yanking the call to rao.score.bern and replacing it by something that shows us the input. Don't apply it to the large input you created! Use a small piece of it:

    sapply(data[1:5], function(x) x[[1]])

    [1] 0 0 0 0 0
    

    That's not what you expected, is it? It's returning just one zero for each element of data. What about this?

    sapply(data[1:5], function(x) x)

    [[1]]
    [1] 0 0 0 0 0
    [[2]]
    [1] 0 0 0 0 0 0 
    ...
    [[5]]
    [1] 0 0 0 0 0 0 0 0 0
    

    Much better! The variable x in the call to sapply refers to the entire vector, which is what you want to pass to your routine. Whence

    p.values <- sapply(data, function(x) rao.score.bern(x, p0)); hist(p.values)

    Figure