Search code examples
rregexdplyrtidyverse

Break character string into components by matching regex


UPDATED QUESTION

I have this character vector

str_ <- "H3K9me0S10ph1K14ac1me0"

I would like to break it into pieces such that I get an output like:

"H3K9: me0 | S10: ph1 | K14: ac1,me0"

Preferably this is done in a manner that utilizes {dplyr}, such that I can perform this operation on a tibble and get a new column with the desired character string output. Any ideas?

As the below section suggests, I'm struggling with getting a table that denotes which modifications are paired with what, e.g. that the me0 goes with H3K9 and BOTH the ac1,me0 go with K14

Any assistance would be so helpful!

Pieces of attempts

Using a slightly different example,

str_ <- "H3K9ac1K14ac1K18ac1me0"

So I've tried breaking the character vector into pieces by extracting all "me[0-9]*" or "ac[0-9]*" etc, then giving them an id which corresponds to their index in the character vector.

# A tibble: 4 x 2
      i m    
  <int> <chr>
1    12 ac1  
2    17 ac1  
3    23 ac1  
4    26 me0 

I need a way to create a column together that tells whether two modifications belong to the same protein, i.e. in this example K14 has ac1 and me0, so their 'together' values should be 'TRUE'. I've tried using the distance between their indices as a surrogate for togetherness, but I don't think this is the best way to do it:

# A tibble: 4 x 2
      i m     unit_diff  together
  <int> <chr>    <int>     <lgl>
1    12 ac1       0          FALSE
2    17 ac1       5          FALSE
3    23 ac1       6          TRUE
4    26 me0       3          TRUE

Any ideas? I've tried using modulo 3, but this doesn't seem to generalize. Is this even the correct way to be doing this? I'm open to suggestions


Solution

  • Use diff to create the 'unit_diff' and then use %%

    library(dplyr)
    df1 %>% 
       mutate(unit_diff = c(0, diff(i)),
        together = unit_diff %% 3 == 0 & unit_diff != 0)
    

    -output

    # A tibble: 4 × 4
          i m     unit_diff together
      <dbl> <chr>     <dbl> <lgl>   
    1    12 ac1           0 FALSE   
    2    17 ac1           5 FALSE   
    3    23 ac1           6 TRUE    
    4    26 me0           3 TRUE    
    

    If we want to check the TRUE adjacent to n number of values, use rleid or rle from base R

    library(data.table)
    n <- 2
    df1 %>% 
       mutate(unit_diff = c(0, diff(i)),
        together = unit_diff %% 3 == 0 & unit_diff != 0) %>%
       group_by(grp = rleid(together)) %>%
       mutate(together = all(together) &  n() == n) %>%
       ungroup %>%
       select(-grp)
    

    For the second updated question, we can use regex to insert some delimiters - i.e. originally, we capture one or more characters that are not lowercase letters (([^a-z]+)) and replace with the backreference of the captured group followed by : (\\1:), then, we insert the | between characters that are a lowercase letter followed by digit and an uppercase letter, remove the lagging : at the end with trimws and finally replace the : with , between the one or more lower case letter followed by one or more digits

    gsub("([a-z]+\\d+):", "\\1,",
      trimws(gsub("(?<=[a-z][0-9])(?=[A-Z])", " | ", 
     gsub("([^a-z]+)", "\\1: ", str_), perl = TRUE), whitespace = ":\\s+"))
    [1] "H3K9: me0 | S10: ph1 | K14: ac1, me0"
    

    data

    df1 <- structure(list(i = c(12, 17, 23, 26), m = c("ac1", "ac1", "ac1", 
    "me0")), class = c("tbl_df", "tbl", "data.frame"), 
    row.names = c(NA, 
    -4L))