Search code examples
rstringtidyverseseparator

Separating strings with a delimiter that only occurs sometimes


I have a colour column that sometimes contains a single colour reference, and sometimes contains multiple colours, which are separated by "||"

library(tidyverse)

id <- c(1:10)
colour <- c("sky-blue","blood-red","lavender-purple",
            "sky-blue||blood-red", "midnight-blue", "blood-red||lavender-purple||sky-blue",
            "grass-green","sky-blue||blood-red||lavender-purple||midnight-blue",
            "grass-green","grass-green||midnight-blue")


df <- tibble("id" = id,
       "colour" = colour)

# A tibble: 10 × 2
      id colour                                             
   <int> <chr>                                              
 1     1 sky-blue                                           
 2     2 blood-red                                          
 3     3 lavender-purple                                    
 4     4 sky-blue||blood-red                                
 5     5 midnight-blue                                      
 6     6 blood-red||lavender-purple||sky-blue               
 7     7 grass-green                                        
 8     8 sky-blue||blood-red||lavender-purple||midnight-blue
 9     9 grass-green                                        
10    10 grass-green||midnight-blue

I would like to separate those colours into individual columns, such that each column only contains one colour, then I would like to stack the colours with duplicate ids. (using gather()) The name of the new colour columns is rather irrelevant, so I went for "col_1", "col_2", etc. , since I will then stack them again. However, if I run separate(), it does the following:

df %>%
  separate(colour, into = c("col_1","col_2","col_3","col_4"), sep = "||")

# A tibble: 10 × 5
      id col_1 col_2 col_3 col_4
   <int> <chr> <chr> <chr> <chr>
 1     1 ""    s     k     y    
 2     2 ""    b     l     o    
 3     3 ""    l     a     v    
 4     4 ""    s     k     y    
 5     5 ""    m     i     d    
 6     6 ""    b     l     o    
 7     7 ""    g     r     a    
 8     8 ""    s     k     y    
 9     9 ""    g     r     a    
10    10 ""    g     r     a  

This also happens if I run it on a single row with the exact right number of columns in the into=

I have looked at some solutions, but haven't found something that covers irregular separator occurrence, and irregular expression length. Any solution would be most welcome.


Solution

  • You need to escape special symbols with \\.

    So try:

    df %>%
      separate_rows(colour, sep = "\\|\\|") %>% 
      mutate(ind = 1) %>% 
      pivot_wider(
        names_from = colour,
        values_from = ind,
        values_fill = 0
      )
    

    Output is:

    # A tibble: 10 × 6
          id `sky-blue` `blood-red` `lavender-purple` `midnight-blue` `grass-green`
       <int>      <dbl>       <dbl>             <dbl>           <dbl>         <dbl>
     1     1          1           0                 0               0             0
     2     2          0           1                 0               0             0
     3     3          0           0                 1               0             0
     4     4          1           1                 0               0             0
     5     5          0           0                 0               1             0
     6     6          1           1                 1               0             0
     7     7          0           0                 0               0             1
     8     8          1           1                 1               1             0
     9     9          0           0                 0               0             1
    10    10          0           0                 0               1             1