Search code examples
rdplyr

How can you reference non-selected columns within across()?


I have a dataframe with integer columns x1, x2, x3, ... and bool columns x1_status, x2_status, x3_status, ...

df = tibble(
    x1 = 1:5,
    x1_status = c(T, T, T, F, F),
    x2 = 6:10,
    x2_status = c(F, T, T, T, F),
    x3 = 11:15,
    x3_status = c(F, F, T, T, T)
)

How do I calculate the mean of x1, x2, x3, ... at the values where x1_status is True, x2_status is True, x3_status is True, and so on, respectively? I'd like to use across() within summarize() so I don't have to list each x column individually.

That is, I'm hoping to get a dataframe with columns mean_x1 = 2, mean_x2 = 8, and mean_x3 = 14.

I was thinking of something like

df |>
    summarize(
        across(
            !matches("_"),
            \(value_col) {
                status_col = str_c(cur_column(), "_status")
                mean(value_col[.data[[status_col]]])
            },
            .names = "mean_{col}"
        )
    )

but that doesn't work. I'd also like to avoid using df[[status_col]], so I can include this in a pipeline that might create the x1, x2, x3, ... columns without having to assign df in the interim.


Solution

  • Using pivot_longer() might be more intuitive:

    library(tidyverse)
    
    df = tibble(
      x1 = 1:5,
      x1_status = c(T, T, T, F, F),
      x2 = 6:10,
      x2_status = c(F, T, T, T, F),
      x3 = 11:15,
      x3_status = c(F, F, T, T, T)
    )
    
    # Pivot to longer dataset
    df_long <- df %>%
      rename_with(
        ~ paste0(., "_value"),
        c(x1, x2, x3)
      ) %>%
      pivot_longer(
        cols = everything(),
        names_sep = "_",
        names_to = c("name", ".value")
      )
    
    # Summarize
    long_summary <- df_long %>%
      filter(status) %>%
      group_by(name) %>%
      summarize(value = mean(value))
    
    # Pivot to wide summary
    wide_summary <- long_summary %>%
      pivot_wider(
        names_from = "name",
        values_from = "value",
        names_prefix = "mean_"
      )
    
    print(wide_summary)