Search code examples
rpurrr

Calculate mean-deviated values (subtract mean of all columns except one from this one column)


I have a dataset with the following structure:

df <- data.frame(id = 1:5,
                 study = c("st1","st2","st3","st4","st5"),
                 a_var = c(10,20,30,40,50),
                 b_var = c(6,5,4,3,2),
                 c_var = c(3,4,5,6,7),
                 d_var = c(80,70,60,50,40))

I would like to calculate the difference between each column that has _var in its name and the mean of all other columns containing _var in their names, like this:

mean_deviated_value <- function(data, variable) {
  md_value = data[,variable] - rowMeans(data[,names(data) != variable])
  md_value
  }
    
df$a_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "a_var")
df$b_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "b_var")
df$c_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "c_var")
df$d_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "d_var")

Which gives me my desired output:

  id study a_var b_var c_var d_var   a_var_md  b_var_md c_var_md d_var_md
1  1   st1    10     6     3    80 -19.666667 -12.33333    -9.80 83.80000
2  2   st2    20     5     4    70  -6.333333 -16.91667   -10.35 70.76667
3  3   st3    30     4     5    60   7.000000 -21.50000   -10.90 57.73333
4  4   st4    40     3     6    50  20.333333 -26.08333   -11.45 44.70000
5  5   st5    50     2     7    40  33.666667 -30.66667   -12.00 31.66667

How do I do it in one go, without repeating the code, preferably with dplyr/purrr?

I tried this:

df %>%
  mutate(across(contains("_var"), ~ list(md = .x - rowMeans(select(., contains("_var") & !.x)))))

And got this error:

Error: Problem with `mutate()` input `..1`.
ℹ `..1 = across(...)`.
x no applicable method for 'select' applied to an object of class "c('double', 'numeric')"

Solution

  • We can use map_dfc with transmute to create *_md columns, and glue syntax for the names.

    library(tidyverse)
    
    nms <- names(df) %>%
            str_subset('^.*_')
    
    bind_cols(df, map_dfc(nms, ~transmute(df, '{.x}_md' := mean_deviated_value(select(df, contains("_var")), .x))))
    #>   id study a_var b_var c_var d_var   a_var_md  b_var_md  c_var_md d_var_md
    #> 1  1   st1    10     6     3    80 -19.666667 -25.00000 -29.00000 73.66667
    #> 2  2   st2    20     5     4    70  -6.333333 -26.33333 -27.66667 60.33333
    #> 3  3   st3    30     4     5    60   7.000000 -27.66667 -26.33333 47.00000
    #> 4  4   st4    40     3     6    50  20.333333 -29.00000 -25.00000 33.66667
    #> 5  5   st5    50     2     7    40  33.666667 -30.33333 -23.66667 20.33333
    

    Note that if you use assigment. The first time rowMeans will compute with b_var, c_bar and d_bar. But the second time, contains("_var") will also capture the previously created a_var_md and use it to compute the means. I don't know if this is intended behaviour but it is worth mentioning.

    df$a_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "a_var")
    select(df, contains("_var"))
    #>   a_var b_var c_var d_var   a_var_md
    #> 1    10     6     3    80 -19.666667
    #> 2    20     5     4    70  -6.333333
    #> 3    30     4     5    60   7.000000
    #> 4    40     3     6    50  20.333333
    #> 5    50     2     7    40  33.666667
    

    We can avoid this by replacing contains("_var") with matches("^.*_var$")

    Created on 2021-12-20 by the reprex package (v2.0.1)