Search code examples
rtidyrdelimiter

tidyr: separate column while retaining delimiter in the first column


I have a column that I am trying to break up into two while retaining the delimiter. I got this far, but part of the delimiter is being dropped. I also need to do this split a second time, adding the delimiter to the first column which I cannot figure out how to do.

duplicates <- data.frame(sample = c("a_1_b1", "a1_2_b1", "a1_c_1_b2"))

duplicates <- separate(duplicates, 
                       sample, 
                       into = c("strain", "sample"),
                       sep = "_(?=[:digit:])")

using only the first name as an example, my output is a_1 and b1 while my desired output is a_1 and _b1.

I would also like to perform this split with the delimiter added to the first column as below.

sample batch
a_1_ b1
a1_2_ b1
a1_c_1_ b2

Edit: This post does not answer my question of how to retain the delimiter, or to control which side of the split it ends up on.


Solution

  • You can use tidyr::extract with capture groups.

    tidyr::extract(duplicates, sample, c("strain", "sample"), '(.*_)(\\w+)')
    
    #   strain sample
    #1    a_1_     b1
    #2   a1_2_     b1
    #3 a1_c_1_     b2
    

    The same regex can also be used with strcapture in base R -

    strcapture('(.*_)(\\w+)', duplicates$sample, 
               proto = list(strain = character(), sample = character()))