I have a dataframe containing the results of multiple Boruta variable selections with environmental variables from different sources as predictors. These predictors are often from different sources (e.g. sources a, b, c), so some have been coded with different suffixes, but representing the same parameter (e.g nitrogen_a, nitrogen_b, nitrogen_c, phosphate_a, phosphate_b etc.).
I need a way to use something like grepl() to identify and group variables that have the same start of the name and collapse them into single variables with the shared variable (e.g. nitrogen, phosphate).
Note, for each row only one variable within a set with a shared variable name prefix contains a non-NA value. So it should be possible to collapse multiple variables into one by simply excluding the NA values. All variables are character vectors.
How might I go about this?
Your results are currently in wide format. The strategy I took was to first make the data long, remove the NA
rows, and make the table wide again using updated column names. Here's a tidyverse solution:
library(tidyverse)
data = tribble(
~ id, ~ x_1, ~ x_2, ~ y_1, ~ y_2,
"a", 1, NA, 100, NA,
"b", NA, 2, NA, 200
)
data %>%
pivot_longer(-id) %>%
filter(!is.na(value)) %>%
# Update name so that everything until the _ is kept
mutate(name = str_extract(name, "[^_]+")) %>%
pivot_wider(names_from = "name", values_from = "value")
The data:
> data
# A tibble: 2 x 5
id x_1 x_2 y_1 y_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 NA 100 NA
2 b NA 2 NA 200
The output:
# A tibble: 2 x 3
id x y
<chr> <dbl> <dbl>
1 a 1 100
2 b 2 200