Search code examples
rresamplingp-value

Bootstrap p-value for correlation coefficient (resampling methods)


I have this large datset (N = 300.000) and with a power analysis I came to the conclusion that I need only 250 observations to find a correlation if it's present.

So, I want to use a bootstrap to pick 1000 samples of size n = 250 to find the range of p-values in these 1000 samples. I am quite unfamiliar with the bootstrap method, but under here I gave an example on how far I got with the boot package. I used the Iris dataset to illustrate.

My desired output is a histogram showing the frequency distribution of the 1000 obtained p-value values and the 95% confidence interval of possible p-values.

Can anyone help out with my script?

#activate iris datset
library(boot)
library(datasets)

#create function to retrieve p-value
boot.fn <- function(data, sample) {
           x <- iris$Petal.Length[i]
           y <- iris$Sepal.Length[i]
           boot.p <- cor.test(iris$Petal.Length[i],
                              iris$Sepal.Length[i])$p.value
           }

#create 1000 samples with bootstrap function
bootstr <- boot(iris, boot.fn, 1000)

Solution

  • the function boot will not provide the desired behavior. However it is quite simple to implement it yourself:

    First some data:

    x1 <- rnorm(1e5)
    y1 <- x1 + rnorm(1e5, 0.5)
    
    cor.test(x1, y1)
    #output
        Pearson's product-moment correlation
    
    data:  x1 and y1
    t = 315.97, df = 99998, p-value < 2.2e-16
    alternative hypothesis: true correlation is not equal to 0
    95 percent confidence interval:
     0.7037121 0.7099151
    sample estimates:
          cor 
    0.7068272 
    

    sample 250 indexes 1000 times:

    #set.seed(1)
    z1 <- replicate(1000, sample(1:length(x1), 250, replace = T))
    

    if without replacement is needed just remove that part

    now go over the columns, use the indexes to subset x1 and y1, calculate the statistic and use the unlisted list to plot a histogram.

    hist(unlist(apply(z1, 2, function(x){
      cor.test(x1[x], y1[x])$p.value
    })), xlab = "p value", main = "Uh)
    

    enter image description here

    perhaps more informative is:

    hist(unlist(apply(z1, 2, function(x){
      cor.test(x1[x], y1[x])$estimate
    })), xlab = "cor", main ="Uh")
    

    enter image description here