I have a dataframe with integer columns x1, x2, x3, ... and bool columns x1_status, x2_status, x3_status, ...
df = tibble(
x1 = 1:5,
x1_status = c(T, T, T, F, F),
x2 = 6:10,
x2_status = c(F, T, T, T, F),
x3 = 11:15,
x3_status = c(F, F, T, T, T)
)
How do I calculate the mean of x1, x2, x3, ... at the values where x1_status is True
, x2_status is True
, x3_status is True
, and so on, respectively? I'd like to use across()
within summarize()
so I don't have to list each x column individually.
That is, I'm hoping to get a dataframe with columns mean_x1 = 2
, mean_x2 = 8
, and mean_x3 = 14
.
I was thinking of something like
df |>
summarize(
across(
!matches("_"),
\(value_col) {
status_col = str_c(cur_column(), "_status")
mean(value_col[.data[[status_col]]])
},
.names = "mean_{col}"
)
)
but that doesn't work. I'd also like to avoid using df[[status_col]]
, so I can include this in a pipeline that might create the x1, x2, x3, ... columns without having to assign df in the interim.
Using pivot_longer()
might be more intuitive:
library(tidyverse)
df = tibble(
x1 = 1:5,
x1_status = c(T, T, T, F, F),
x2 = 6:10,
x2_status = c(F, T, T, T, F),
x3 = 11:15,
x3_status = c(F, F, T, T, T)
)
# Pivot to longer dataset
df_long <- df %>%
rename_with(
~ paste0(., "_value"),
c(x1, x2, x3)
) %>%
pivot_longer(
cols = everything(),
names_sep = "_",
names_to = c("name", ".value")
)
# Summarize
long_summary <- df_long %>%
filter(status) %>%
group_by(name) %>%
summarize(value = mean(value))
# Pivot to wide summary
wide_summary <- long_summary %>%
pivot_wider(
names_from = "name",
values_from = "value",
names_prefix = "mean_"
)
print(wide_summary)