Search code examples
rdataframedplyrtidyverse

Subtracting vectors by group from two dataframes


I have two dataframes in R. The first dataframe contains several columns-features, as well as a column that tells whether a particular sample (row) belongs to a certain group (a factor variable). The second dataframe contains the same number of columns, and the number of rows equals the number of unique groups. I want to subtract from each sample of the first dataframe the corresponding vector from the second dataframe, where the correspondence is specified using the key-group in the column of the same name.

Here is an example of the main dataset:

df_repr <- structure(list(f1 = c(-3.9956064225704, 
-0.52380279948658, 0.61089389331505, -3.47273625634875, -4.486918671214, 
-6.1761970731672, -4.62305749757367, -4.42540643005429, -3.61613137597131, 
-3.29821425516253), f2 = c(-1.57918114753228, 
-4.10523012500727, -1.80270009366593, -0.00905317702835884, -0.899585192079915, 
-2.89341515186212, 0.0132542126386332, -3.32639898550135, -0.867793877742314, 
0.0911950321630834), f3 = c(-6.02532301769732, 
-4.90073348094302, -3.73159604513274, -3.55290209472808, -6.63194560195811, 
2.69409789701296, -4.17675978927128, -3.84141885970095, -1.20571283849034, 
1.54287440902102), group = structure(c(1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -10L))

Here is an example dataframe with vectors to be subtracted from each row of the corresponding group of the first dataframe:

to_subtract <- structure(list(group = structure(1:2, .Label = c("A", 
"B"), class = "factor"), f1 = c(-2.78048744402161, 
-2.33583431665818), f2 = c(-2.56086962108741, 
-0.689157827347865), f3 = c(-3.60224982918457, 
-0.782365376308658)), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"))

# # A tibble: 2 × 4
#   group    f1     f2     f3
#   <fct> <dbl>  <dbl>  <dbl>
# 1 A     -2.78 -2.56  -3.60
# 2 B     -2.34 -0.689 -0.782

I tried to do it like this:

df_repr %>%
  group_by(group) %>%
  mutate(across(where(is.numeric),
         ~ . - to_subtract[to_subtract$group == unique(.$group), -1]))

But I get the following error:

Error in `mutate()`:
ℹ️ In argument: `across(...)`.
ℹ️ In group 1: `group = A`.
Caused by error in `across()`:
! Can't compute column `f1`.
Caused by error in `f1$group`:
! $ operator is invalid for atomic vectors

Expected output for this example:

       f1     f2      f3 group
    <dbl>  <dbl>   <dbl> <fct>
 1 -1.22   0.982 -2.42   A    
 2  2.26  -1.54  -1.30   A    
 3  3.39   0.758 -0.129  A    
 4 -0.692  2.55   0.0493 A    
 5 -1.71   1.66  -3.03   A    
 6 -3.84  -2.20   3.48   B    
 7 -2.29   0.702 -3.39   B    
 8 -2.09  -2.64  -3.06   B    
 9 -1.28  -0.179 -0.423  B    
10 -0.962  0.780  2.33   B 

Solution

  • You can use powerjoin with (conflict = `-`):

    library(powerjoin)
    
    power_left_join(df_repr, to_subtract, by = "group", conflict = `-`)
    
    # A tibble: 10 × 4
       group     f1     f2      f3
       <fct>  <dbl>  <dbl>   <dbl>
     1 A     -1.22   0.982 -2.42
     2 A      2.26  -1.54  -1.30  
     3 A      3.39   0.758 -0.129
     4 A     -0.692  2.55   0.0493
     5 A     -1.71   1.66  -3.03
     6 B     -3.84  -2.20   3.48
     7 B     -2.29   0.702 -3.39
     8 B     -2.09  -2.64  -3.06  
     9 B     -1.28  -0.179 -0.423
    10 B     -0.962  0.780  2.33
    

    Another dplyr::group_modify approach:

    df_repr %>%
      group_by(group) %>%
      group_modify(~ mutate(.x, across(f1:f3, \(val) {
        val - filter(to_subtract, group == .y$group)[[cur_column()]]
      }))) %>%
      ungroup()