Search code examples
rtidyversecombinationsrcpp

How to calculate normalized ratios in all possible combinations efficiently for a large matrix in R?


I want to calculate normalised ratios in all possible combinations efficiently for a large matrix in R. I have asked a similar question earlier here and with a small data and the solutions provided there worked fine. But when I am trying to apply the same solution for a large dataset (400 x 2151), my system is getting hang. My system is having 16 GB RAM with Intel i7 processer. Here is the code with data

df <- matrix(rexp(860400), nrow = 400, ncol = 2151)

Solution provided by @Ronak Shah

cols <- 1:ncol(df)
temp <- expand.grid(cols, cols)
new_data <- (df[,temp[,2]] - df[,temp[,1]])/(df[,temp[,2]] + df[,temp[,1]])

Or the following solution as provided by @akrun

f1 <- function(i, j) (df[, i] - df[, j])/(df[, i] + df[, j])
out <- outer(seq_along(df), seq_along(df), FUN = f1)
colnames(out) <- outer(names(df), names(df), paste, sep = "_")

Both the solutions taking a very long time and the system is getting hang. So, how can I efficiently do it?

Update

Upadate on expected output

library(tidyverse)

#Fake dataset
df = structure(list(var_1 = c(0.035, 0.047, 0.004, 0.011, 0.01, 0.01, 0.024), 
                    var_2 = c(0.034, 0.047, 0.004, 0.012, 0.01, 0.011, 0.025), 
                    var_3 = c(0.034, 0.047, 0.006, 0.013, 0.011, 0.013, 0.026), 
                    var_4 = c(0.034, 0.046, 0.008, 0.016, 0.014, 0.015, 0.028), 
                    var_5 = c(0.034, 0.046, 0.009, 0.017, 0.015, 0.016, 0.029)), 
               class = "data.frame", row.names = c(NA, -7L))

cols <- 1:ncol(df)
mat_out <- do.call(cbind, lapply(cols, function(xj) 
  sapply(cols, function(xi) (df[, xj] - df[, xi])/(df[, xj] + df[, xi]))))

colnames(mat_out) <-  outer(names(df), names(df), paste, sep = ",")

y <- read.table(text = "s_no    y
1   95.512
2   97.9
3   92.897
4   94.209
5   87.472
6   91.109
7   92.83", header = T)

mat_out %>% as.data.frame() %>% 
  mutate(id = row_number()) %>% 
  left_join(y, by = c("id" = "s_no")) %>% 
  pivot_longer(cols = -c(y, id)) %>% 
  group_by(name) %>% 
  mutate(correl = cor(value, y, use = "complete.obs")) %>% 
  distinct(name, .keep_all = TRUE) %>% 
  separate(name, c("Wav1", "Wav2"), sep = ",") %>% 
  select(-c("id", "y", "value")) %>% 
  pivot_wider(names_from = Wav2, values_from = correl)

#> # A tibble: 5 × 6
#>   Wav1   var_1  var_2  var_3  var_4  var_5
#>   <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 var_1 NA     -0.190 -0.358 -0.537 -0.551
#> 2 var_2  0.190 NA     -0.322 -0.528 -0.544
#> 3 var_3  0.358  0.322 NA     -0.682 -0.667
#> 4 var_4  0.537  0.528  0.682 NA     -0.595
#> 5 var_5  0.551  0.544  0.667  0.595 NA

Solution

  • You can use rcpp to make your code faster:

    Rcpp::cppFunction("
      std::vector<double> my_fun(arma::mat& x, arma::vec& y){
        int p = x.n_cols - 1;
        std::vector<double> result;
        for(int i = 0; i<p; i++){
          auto m = (x.cols(i+1, p).each_col() - x.col(i));
          m /= (x.cols(i+1, p).each_col() + x.col(i));
          auto a = arma::conv_to<std::vector<double>>::from(arma::cor(m, y));
          result.insert(result.end(), a.begin(), a.end());
        }
          
       return result;
    }", 'RcppArmadillo')
    
    my_fun(df, y) # takes approximately 14seconds. 
    

    You could use the STL functions to make it even faster. Though the code will be longer. on my computer this takes 6seconds

    mat <- matrix(rexp(860400), nrow = 400, ncol = 2151)
    y <- rnorm(nrow(df), 700, 10)
    
    my_fun(mat, y) # This works