Search code examples
rgrepl

Combining dataframe columns with same prefix


I have a dataframe containing the results of multiple Boruta variable selections with environmental variables from different sources as predictors. These predictors are often from different sources (e.g. sources a, b, c), so some have been coded with different suffixes, but representing the same parameter (e.g nitrogen_a, nitrogen_b, nitrogen_c, phosphate_a, phosphate_b etc.).

I need a way to use something like grepl() to identify and group variables that have the same start of the name and collapse them into single variables with the shared variable (e.g. nitrogen, phosphate).

Note, for each row only one variable within a set with a shared variable name prefix contains a non-NA value. So it should be possible to collapse multiple variables into one by simply excluding the NA values. All variables are character vectors.

How might I go about this?


Solution

  • Your results are currently in wide format. The strategy I took was to first make the data long, remove the NA rows, and make the table wide again using updated column names. Here's a tidyverse solution:

    library(tidyverse)
    
    data = tribble(
      ~ id, ~ x_1, ~ x_2, ~ y_1, ~ y_2,
      "a", 1, NA, 100, NA,
      "b", NA, 2, NA, 200
    )
    
    data %>% 
      pivot_longer(-id) %>% 
      filter(!is.na(value)) %>% 
      # Update name so that everything until the _ is kept
      mutate(name = str_extract(name, "[^_]+")) %>% 
      pivot_wider(names_from = "name", values_from = "value")
    

    The data:

    > data
    # A tibble: 2 x 5
      id      x_1   x_2   y_1   y_2
      <chr> <dbl> <dbl> <dbl> <dbl>
    1 a         1    NA   100    NA
    2 b        NA     2    NA   200
    

    The output:

    # A tibble: 2 x 3
      id        x     y
      <chr> <dbl> <dbl>
    1 a         1   100
    2 b         2   200