I have two dataframes. 1 dataframe is 10 x 484 and the other is 10 x 2083.
I would like to know the correlation between each of the 484 data points with the 2083, and have the output in a 484 x 2083 matrix. I am trying to use foreach to speed things up
registerDoParallel(cl <- makeCluster(10, "PSOCK"))
out <- foreach(j=1:ncol(df1), .combine = 'rbind', .packages=c("magrittr", "dplyr")) %:%
foreach(i = 1:ncol(df2), .combine = 'c') %dopar% {
a <- cor.test(df1[,j], df2[,i], method = "spearman")$p.value
}
I get the error of
Error in { : task 1 failed - "'y' must be a numeric vector".
All columns values for both dataframes are numeric. The above code worked when using two smaller practice matrices below. Any pointers in the right direction would be appreciated.
testmatrix1 <- matrix(rexp(1800, rate=.1), ncol=6)
colnames(testmatrix1) <- paste0("testmatrix1.Sample", 1:ncol(testmatrix1))
testmatrix2 <- matrix(rexp(3600, rate=.1), ncol=12)
colnames(testmatrix2) <- paste0("testmatrix2.Sample", 1:ncol(testmatrix2))
Even in your situation where you're calculating over 1,000,000 correlations, the base correlation function is going to be much faster than the double loop, even if it's parallelized. The parallelization is not costless (in terms of computational overhead and you're still doing a double loop over a million calculations. Your code worked for me:
df1 <- as.data.frame(matrix(runif(10*484, -1, 1), ncol=484))
df2 <- as.data.frame(matrix(runif(10*2083, -1, 1), ncol=2083))
library(doParallel)
registerDoParallel(cl <- makeCluster(10, "PSOCK"))
out <- foreach(j=1:ncol(df1), .combine = 'rbind', .packages=c("magrittr", "dplyr")) %:%
foreach(i = 1:ncol(df2), .combine = 'c') %dopar% {
a <- cor.test(df1[,j], df2[,i], method = "spearman")$p.value
}
This took just over 2 minutes on my Apple M2 Max MacBook Pro (96GB RAM). If you look at the internals of cor.test()
you can pull out the parts you need and speed up the computation a lot. Here's how you could do it. First, you can replicate the pspearman()
function from the internals of cor.test()
. You can see all the code for the function by typing stats:::cor.test.default
(without parentheses) at the command line and hitting enter.
pspearman <- function(q, n, lower.tail = TRUE) {
den <- (n * (n^2 - 1))/6
r <- 1 - q/den
pt(r/sqrt((1 - r^2)/(n - 2)), df = n - 2,
lower.tail = !lower.tail)
}
Next, we can make the correlations. For cor(x,y)
when x
and y
are data frames or matrices with the same number of rows, then cor(x,y)
will produce a matrix with dimensions ncol(x)
by ncol(y)
matrix of correlations across the two data frames.
R <- cor(df1, df2, method="spearman")
Next, we can make the pieces that are relevant to calculating the p-value for the test. n
is the number of rows in the data frame. q
is a ncol(df1)
by ncol(df2)
matrix defined by n
and R
.
n <- nrow(df1)
q <- (n^3 - n) * (1 - R)/6
Next, we can calculate the p-value. For values of q
bigger than (n^3-n/6)
you use the upper tail of the distribution, otherwise you use the lower tail of the distribution for p
. Multiplying by 2 gives the two-tailed p-value.
p <- 2*apply(q,
1,
function(Q)ifelse(Q > (n^3 -n/6),
pspearman(Q, n, lower.tail = FALSE),
pspearman(Q, n, lower.tail = TRUE)))
Finally, it is possible that p-values calculated as above will be above 1, the code below just changes those values to 1 (just as cor.test()
does.
if(any(p > 1))p[which(p > 1, arr.ind=TRUE)] <- 1
This way, using cor()
and calculating the p-value on the entire matrix at once takes about 0.2 seconds - 650 times faster than the double loop (even when it's parallelized).