Search code examples
rloopsdplyrtidyrends-with

Is there a way to dplyr (tidyverse) map my dataset, find columns ending with the same suffix then keep just one?


Let's say I have two different measurement tools (a3 and a4) developed to assess psychological attributes, such as communication skills. a3 tool is the first version of the instrument and it has 6 items (1,2,3,4,5,6). A4 is the updated version of a3 and it has only certain items developed to replace some items from the a3 version. Therefore, while my dataset always has items 1 to 6 from a3, I can have only items 1, 2 and 5 from a4 version.

As I have these two instruments, I labeled them by the domain_version_item. Therefore, I have a dataset like this one:

> asq_online %>% 
+   names
 [1] "id"           "age_interval" "com_a3_1"     "com_a3_2"     "com_a3_3"    
 [6] "com_a3_4"     "com_a3_5"     "com_a3_6"     "com_a4_1"     "com_a4_2"

If I want to return the a3 version, that's ok:

+   select(starts_with("com_a3")) %>% 
+   names
[1] "com_a3_1" "com_a3_2" "com_a3_3" "com_a3_4" "com_a3_5" "com_a3_6"
> 

If my goal is to return a4 only, that's ok:

> asq_online %>% 
+   select(starts_with("com_a4")) %>% 
+   names
[1] "com_a4_1" "com_a4_2"

So, now I know the a4 version was made to replace the items 1 and 2 from a3 version and I should compose a dataset with com_a4_1, com_a4_2, com_a3_3, com_a3_4, com_a3_5, com_a3_6

I'm struggling to build this syntax. I see I have to compare the items with the same suffix and then replace it to have a print like that:

"com_a4_1" "com_a4_2" "com_a3_3" "com_a3_4" "com_a3_5" "com_a3_6"

That's my dataset

asq_online <- structure(list(id = c(1, 2, 3, 4, 5), age_interval = c(12, 12, 
12, 12, 12), com_a3_1 = c(0, 0, 5, 0, 10), com_a3_2 = c(5, 0, 
10, 0, 5), com_a3_3 = c(10, 10, 10, 0, 5), com_a3_4 = c(5, 0, 
0, 10, 10), com_a3_5 = c(5, 5, 0, 10, 10), com_a3_6 = c(5, 10, 
0, 5, 5), com_a4_1 = c(10, 5, 5, 10, 10), com_a4_2 = c(10, 5, 
0, 0, 10)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", 
"data.frame"))

Please let me know if I'm missing any information. I would like to remain within tidyverse environment. Thank you.


Solution

  • To use tidyverse most effectively, it's typically easier to operate on rows after you convert the data to long format (instead of writing some logic or loops to help pick columns)

    Here is a solution where I first pivot_longer to tidy-long format. Then I group_by id-age_interval-domain-item and then use mutate max and if_else to identify the values with the highest version within the id-age_interval-domain-item groups. Then I filter to get rid of rows that are from outdated versions. Depending on what you are doing next, you could stay in long format, or you could pivot_wider to go back to your original format.

    library(tidyverse)
    
    asq_online <- structure(list(id = c(1, 2, 3, 4, 5), age_interval = c(12, 12, 12, 12, 12), com_a3_1 = c(0, 0, 5, 0, 10), com_a3_2 = c(5, 0, 10, 0, 5), com_a3_3 = c(10, 10, 10, 0, 5), com_a3_4 = c(5, 0, 0, 10, 10), com_a3_5 = c(5, 5, 0, 10, 10), com_a3_6 = c(5, 10, 0, 5, 5), com_a4_1 = c(10, 5, 5, 10, 10), com_a4_2 = c(10, 5, 0, 0, 10)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))
    
    (asq_online_long <- asq_online %>%
      pivot_longer(cols = -(1:2), names_to = c("domain", "version", "item"), names_sep = "_") %>%
      group_by(id, age_interval, domain, item) %>%
      mutate(latest_version_available = max(version),
             value_from_latest_version = if_else(version == latest_version_available, value, NA_real_)))
    #> # A tibble: 40 x 8
    #> # Groups:   id, age_interval, domain, item [30]
    #>       id age_interval domain version item  value latest_version_~
    #>    <dbl>        <dbl> <chr>  <chr>   <chr> <dbl> <chr>           
    #>  1     1           12 com    a3      1         0 a4              
    #>  2     1           12 com    a3      2         5 a4              
    #>  3     1           12 com    a3      3        10 a3              
    #>  4     1           12 com    a3      4         5 a3              
    #>  5     1           12 com    a3      5         5 a3              
    #>  6     1           12 com    a3      6         5 a3              
    #>  7     1           12 com    a4      1        10 a4              
    #>  8     1           12 com    a4      2        10 a4              
    #>  9     2           12 com    a3      1         0 a4              
    #> 10     2           12 com    a3      2         0 a4              
    #> # ... with 30 more rows, and 1 more variable:
    #> #   value_from_latest_version <dbl>
    
    # if you need it back in original format
    (asq_online_latest_versions_wide <- asq_online_long %>%
      filter(!is.na(value_from_latest_version)) %>%
      select(-latest_version_available, -value, value = value_from_latest_version) %>%
      pivot_wider(names_from = c("domain", "version", "item")))
    #> # A tibble: 5 x 8
    #> # Groups:   id, age_interval [5]
    #>      id age_interval com_a3_3 com_a3_4 com_a3_5 com_a3_6 com_a4_1 com_a4_2
    #>   <dbl>        <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
    #> 1     1           12       10        5        5        5       10       10
    #> 2     2           12       10        0        5       10        5        5
    #> 3     3           12       10        0        0        0        5        0
    #> 4     4           12        0       10       10        5       10        0
    #> 5     5           12        5       10       10        5       10       10
    

    Created on 2019-12-01 by the reprex package (v0.3.0)