I have this large datset (N = 300.000) and with a power analysis I came to the conclusion that I need only 250 observations to find a correlation if it's present.
So, I want to use a bootstrap to pick 1000 samples of size n = 250 to find the range of p-values in these 1000 samples. I am quite unfamiliar with the bootstrap method, but under here I gave an example on how far I got with the boot package. I used the Iris dataset to illustrate.
My desired output is a histogram showing the frequency distribution of the 1000 obtained p-value values and the 95% confidence interval of possible p-values.
Can anyone help out with my script?
#activate iris datset
library(boot)
library(datasets)
#create function to retrieve p-value
boot.fn <- function(data, sample) {
x <- iris$Petal.Length[i]
y <- iris$Sepal.Length[i]
boot.p <- cor.test(iris$Petal.Length[i],
iris$Sepal.Length[i])$p.value
}
#create 1000 samples with bootstrap function
bootstr <- boot(iris, boot.fn, 1000)
the function boot
will not provide the desired behavior.
However it is quite simple to implement it yourself:
First some data:
x1 <- rnorm(1e5)
y1 <- x1 + rnorm(1e5, 0.5)
cor.test(x1, y1)
#output
Pearson's product-moment correlation
data: x1 and y1
t = 315.97, df = 99998, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7037121 0.7099151
sample estimates:
cor
0.7068272
sample 250 indexes 1000 times:
#set.seed(1)
z1 <- replicate(1000, sample(1:length(x1), 250, replace = T))
if without replacement is needed just remove that part
now go over the columns, use the indexes to subset x1
and y1
, calculate the statistic and use the unlisted list to plot a histogram.
hist(unlist(apply(z1, 2, function(x){
cor.test(x1[x], y1[x])$p.value
})), xlab = "p value", main = "Uh)
perhaps more informative is:
hist(unlist(apply(z1, 2, function(x){
cor.test(x1[x], y1[x])$estimate
})), xlab = "cor", main ="Uh")