Search code examples
rsummarizespread

Group by, summarize, spread in R not working


I have a data frame that looks like the following:

  ID  Code  Desc
  1   0A    Red
  1   NA    Red
  2   1A    Blue
  3   2B    Green

I want to first create a new column where I concatenate the values in the Code column where the IDs are the same. So:

  ID  Combined_Code  Desc
  1    0A | NA       Red
  2    1A            Blue
  3    2B            Green

Then I want to take the original Code column and spread it. The values in this case would be a count of how many times each Code shows up for a given ID. So:

  ID  Combined_Code 0A  NA  1A  2B  Desc
  1    0A | NA      1   1   0   0   Red
  2    1A           0   0   1   0   Blue
  3    2B           0   0   0   1   Green

I've tried:

sample_data %>%
 group_by(ID) %>%
 summarise(Combined_Code = paste(unique(Combined_Code), collapse ='|'))

This works for creating the concatenation. However, I can't get this to work in tandem with spread:

 sample_data %>%
  group_by(ID) %>%
  summarise(Combined_Code = paste(unique(Combined_Code), collapse ='|'))

sample_data <- spread(count(sample_data, ID, Combined_Code, Desc., Code), Code, n, fill = 0)

Doing this spreads, but drops the concatenation. I've also tried this with filter instead of summarise, which gives the same result. This results in:

 ID  Combined_Code 0A  NA  1A  2B  Desc
  1    0A          1   0   0   0   Red
  1    NA          0   1   0   0   Red
  2    1A          0   0   1   0   Blue
  3    2B          0   0   0   1   Green

Finally, I've tried piping spread through the summarise function:

sample_data %>%
  group_by(ID) %>%
  summarise(Combined_Code = paste(unique(Combined_Code), collapse ='|')) %>%
  spread(count(sample_data, ID, Combined_Code, Desc., Code), Code, n, fill = 0)

This results in the error:

Error: `var` must evaluate to a single number or a column name, not a list
Run `rlang::last_error()` to see where the error occurred.

What can I do to solve these problems?


Solution

  • We can do a group by paste

    library(dplyr)
    library(stringr)
    df1 %>%
       group_by(ID, Desc) %>%
       summarise(Combined_Code = str_c(Code, collapse="|"))
    # A tibble: 3 x 3
    # Groups:   ID [3]
    #     ID Desc  Combined_Code
    #  <int> <chr> <chr>        
    #1     1 Red   0A|0B        
    #2     2 Blue  1A           
    #3     3 Green 2B     
    

    For the second case, after creating a 'val' column of 1s, paste the 'Code' elements afte grouping by 'ID', 'Desc', then use pivot_wider from tidyr to reshape from 'long' to 'wide format.

    library(tidyr)
    df1 %>% 
       mutate(val = 1) %>%
       group_by(ID, Desc) %>% 
       mutate(Combined_Code = str_c(Code, collapse="|")) %>% 
       pivot_wider(names_from = Code, values_from = val, values_fill = list(val = 0))
    # A tibble: 3 x 7
    # Groups:   ID, Desc [3]
    #    ID Desc  Combined_Code  `0A`  `0B`  `1A`  `2B`
    #  <int> <chr> <chr>         <dbl> <dbl> <dbl> <dbl>
    #1     1 Red   0A|0B             1     1     0     0
    #2     2 Blue  1A                0     0     1     0
    #3     3 Green 2B                0     0     0     1
    

    The OP's expected output is

      ID  Combined_Code 0A  0B  1A  2B  Desc
      1    0A | 0B      1   1   0   0   Red
      2    1A           0   0   1   0   Blue
      3    2B           0   0   0   1   Green
    

    Update

    For the updated dataset, there are NA elements in the 'Code', and by default str_c returns NA if there any NA as one of the elements, while paste still returns the NA along with the other elements. Here, we replace the str_c with paste

    df2 %>% 
        mutate(val = 1) %>%
        group_by(ID, Desc) %>% 
        mutate(Combined_Code = paste(Code, collapse="|")) %>% 
        pivot_wider(names_from = Code, values_from = val, values_fill = list(val = 0))
    # A tibble: 3 x 7
    # Groups:   ID, Desc [3]
    #     ID Desc  Combined_Code  `0A`  `NA`  `1A`  `2B`
    #  <int> <chr> <chr>         <dbl> <dbl> <dbl> <dbl>
    #1     1 Red   0A|NA             1     1     0     0
    #2     2 Blue  1A                0     0     1     0
    #3     3 Green 2B                0     0     0     1
    

    data

    df1 <- structure(list(ID = c(1L, 1L, 2L, 3L), Code = c("0A", "0B", "1A", 
    "2B"), Desc = c("Red", "Red", "Blue", "Green")), 
    class = "data.frame", row.names = c(NA, 
    -4L))
    
    
    
    df2 <- structure(list(ID = c(1L, 1L, 2L, 3L), Code = c("0A", NA, "1A", 
    "2B"), Desc = c("Red", "Red", "Blue", "Green")), class = "data.frame",
    row.names = c(NA, 
    -4L))