Search code examples
rmapply

mapply over different columns of multiple data


I have a function, which takes two vectors and computes a numeric value (like cor correlation would do). However, I have two datasets with around 6000 columns (the two datasets have the same dimensions), where the function should return one vector with the values of the correlation.

The code with a loop would look like this:

set.seed(123)
m=matrix(rnorm(9),ncol=3)
n=matrix(rnorm(9,10),ncol=3)

colNumber=dim(m)[2]
ReturnData=rep(NA,colNumber)

for (i in 1:colNumber){
    ReturnData[i]=cor(m[,i],n[,i])
}

This works fine, but for efficiency reasons I want to use the apply-family, obviously, the mapply function.

However,mapply(cor,m,n) returns a vector with length 9 of NAs, where it should return:

> ReturnData
[1]  0.1247039 -0.9641188  0.5081204

EDIT/SOLUTION

The solution as given by @akrun was the usage of dataframes instead of matrices.

Furthermore, a speed test between the two proposed solutions showed, that the mapply-version is faster than sapply:

require(rbenchmark) 
set.seed(123)
#initiate the two dataframes for the comparison 
m=data.frame(matrix(rnorm(10^6),ncol=100))
n=data.frame(matrix(rnorm(10^6),ncol=100))
#indx is needed for the sapply function to get the column numbers
indx=seq_len(ncol(m)) 
benchmark(s1=mapply(cor, m,n), s2=sapply(indx, function(i) cor(m[,i], n[,i])), order="elapsed", replications=100)

#   test replications elapsed relative user.self sys.self user.child sys.child
# 2   s2          100    4.16    1.000      4.15        0         NA        NA
# 1   s1          100    4.33    1.041      4.32        0         NA        NA

Solution

  • Because your dataset is matrix, the mapply would loop through each element instead of each column. To avoid that, convert to dataframe. I am not sure how efficient this would be for big datasets.

    mapply(cor, as.data.frame(m), as.data.frame(n))
    #     V1         V2         V3 
    #0.1247039 -0.9641188  0.5081204 
    

    Another option is to use sapply without converting to data.frame

     indx <- seq_len(ncol(m))
     sapply(indx, function(i) cor(m[,i], n[,i]))
     #[1]  0.1247039 -0.9641188  0.5081204