Search code examples
rloopslapplystatistics-bootstrap

Take one sample per row at a time, compute mean, loop


In R I have a matrix of 47 rows and 30 columns. Each cell contains a numerical value (varying from 0.0 to 1.0). Some cells have "NA" instead of a numerical value.

This is what I would like to do:

  1. For each row, sample one random value until all 47 rows were sampled once. Only numerical values can be sampled (NA's should be ignored).
  2. Take these 47 values, compute the mean, and store the mean.
  3. Repeat this process 10,000 times with replacement.
  4. Determine the 95% interval (2.5%-97.5%) of these 10,000 means.
  5. Plot a histogram of the 10,000 means showing the boundaries the 2.5% and 97.5% interval.
  6. Determine whether an observed value falls inside or outside the boundaries.
  7. Compute the P-value of the observed mean.

It's important that only one sample is drawn from each row (randomly) and that every row is sampled once in every iteration.

I hope I'm not asking too much :-) I appreciate any help!


Solution

  • First lets make so data to experiment with

    set.seed(100)
    example <- matrix(runif(47*30), nrow = 47)
    example[sample(length(example), 250)] <- NA
    

    Now we can calculate our means. The apply function samples a random value from each row (!is.na excludes NA values), mean gets the mean, and replicate repeats this 10000 times.

    exmeans <- replicate(10000, mean(apply(example, 1, 
                                     function(x) sample(x[!is.na(x)], 1))))
    

    Confidence intervals can be computed two different ways. The first uses the example means as an empirical distribution and calculates the mean from that, the second uses a normal distribution to calculate the probablilities.

    confint <- quantile(exmeans, c(0.025, 0.975))
    confint <- qnorm(c(0.025, 0.975), mean = mean(exmeans), sd = sd(exmeans))
    

    Next the plots you wanted

    hist(exmeans)
    abline(v = confint, col = "red")
    

    histogram of quantiles

    Finally the p-value information. Once again we can eigther use an empirical distribution or a normal distribution. These provide p-values for the lower tails of the distribution, use 1 - result for upper tails.

    newvalue > confint[1] & newvalue < confint[2]
    ecdf(exmeans)(newvalue)
    pnorm(newvalue, mean = mean(exmeans), sd = sd(exmeans))