I want to calculate normalised ratios in all possible combinations efficiently for a large matrix in R. I have asked a similar question earlier here and with a small data and the solutions provided there worked fine. But when I am trying to apply the same solution for a large dataset (400 x 2151), my system is getting hang. My system is having 16 GB RAM with Intel i7 processer. Here is the code with data
df <- matrix(rexp(860400), nrow = 400, ncol = 2151)
Solution provided by @Ronak Shah
cols <- 1:ncol(df)
temp <- expand.grid(cols, cols)
new_data <- (df[,temp[,2]] - df[,temp[,1]])/(df[,temp[,2]] + df[,temp[,1]])
Or the following solution as provided by @akrun
f1 <- function(i, j) (df[, i] - df[, j])/(df[, i] + df[, j])
out <- outer(seq_along(df), seq_along(df), FUN = f1)
colnames(out) <- outer(names(df), names(df), paste, sep = "_")
Both the solutions taking a very long time and the system is getting hang. So, how can I efficiently do it?
Update
Upadate on expected output
library(tidyverse)
#Fake dataset
df = structure(list(var_1 = c(0.035, 0.047, 0.004, 0.011, 0.01, 0.01, 0.024),
var_2 = c(0.034, 0.047, 0.004, 0.012, 0.01, 0.011, 0.025),
var_3 = c(0.034, 0.047, 0.006, 0.013, 0.011, 0.013, 0.026),
var_4 = c(0.034, 0.046, 0.008, 0.016, 0.014, 0.015, 0.028),
var_5 = c(0.034, 0.046, 0.009, 0.017, 0.015, 0.016, 0.029)),
class = "data.frame", row.names = c(NA, -7L))
cols <- 1:ncol(df)
mat_out <- do.call(cbind, lapply(cols, function(xj)
sapply(cols, function(xi) (df[, xj] - df[, xi])/(df[, xj] + df[, xi]))))
colnames(mat_out) <- outer(names(df), names(df), paste, sep = ",")
y <- read.table(text = "s_no y
1 95.512
2 97.9
3 92.897
4 94.209
5 87.472
6 91.109
7 92.83", header = T)
mat_out %>% as.data.frame() %>%
mutate(id = row_number()) %>%
left_join(y, by = c("id" = "s_no")) %>%
pivot_longer(cols = -c(y, id)) %>%
group_by(name) %>%
mutate(correl = cor(value, y, use = "complete.obs")) %>%
distinct(name, .keep_all = TRUE) %>%
separate(name, c("Wav1", "Wav2"), sep = ",") %>%
select(-c("id", "y", "value")) %>%
pivot_wider(names_from = Wav2, values_from = correl)
#> # A tibble: 5 × 6
#> Wav1 var_1 var_2 var_3 var_4 var_5
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 var_1 NA -0.190 -0.358 -0.537 -0.551
#> 2 var_2 0.190 NA -0.322 -0.528 -0.544
#> 3 var_3 0.358 0.322 NA -0.682 -0.667
#> 4 var_4 0.537 0.528 0.682 NA -0.595
#> 5 var_5 0.551 0.544 0.667 0.595 NA
You can use rcpp
to make your code faster:
Rcpp::cppFunction("
std::vector<double> my_fun(arma::mat& x, arma::vec& y){
int p = x.n_cols - 1;
std::vector<double> result;
for(int i = 0; i<p; i++){
auto m = (x.cols(i+1, p).each_col() - x.col(i));
m /= (x.cols(i+1, p).each_col() + x.col(i));
auto a = arma::conv_to<std::vector<double>>::from(arma::cor(m, y));
result.insert(result.end(), a.begin(), a.end());
}
return result;
}", 'RcppArmadillo')
my_fun(df, y) # takes approximately 14seconds.
You could use the STL functions to make it even faster. Though the code will be longer. on my computer this takes 6seconds
mat <- matrix(rexp(860400), nrow = 400, ncol = 2151)
y <- rnorm(nrow(df), 700, 10)
my_fun(mat, y) # This works