Search code examples
rstringdataframesplitdelimiter

Splitting a string that is delimited by multiple delimiters (regex)


My data is structured like this:

df <- data.frame(SampleID = c(
  "Akt.B.M_1867_1:100", "Akt.B.M_1869_1:100", "Akt.B.M_1871_1:100",
  "Akt.B.M_1878_1:100", "M.M.K_1259_1:100",   "M.M.K_1262_1:100",
  "M.M.K_1264_1:100",   "M.M.K_1272_1:100",   "SKO.Shard_30_1:100",
  "SKO.Shard_32_1:100", "SKO.Shard_45_1:100", "Zh.San_618_1:100",
  "Zh.San_624_1:100",   "Zh.San_629_1:100"
))

I would like to extract the characters in front of the first period and the characters surrounded by the underline and place them into two subsequent columns. I tried to use separate_wider_delim but there are some entries that have one period and some entries that have two periods. How would I do this in separate_wider_regex or is there a better way without using this function?

Thanks


Solution

  • 1) separate_wider_regex Using the input in the Note at the end use separate_wider_regex as shown:

    library(dplyr)
    library(tidyr)
    
    df %>%
      separate_wider_regex(SampleID, c(A = "[^.]*", "\\..*_", B = ".*", "_.*"), 
        cols_remove = FALSE)
    

    giving

    # A tibble: 14 × 3
       A     B     SampleID          
       <chr> <chr> <chr>             
     1 Akt   1867  Akt.B.M_1867_1:100
     2 Akt   1869  Akt.B.M_1869_1:100
     3 Akt   1871  Akt.B.M_1871_1:100
     4 Akt   1878  Akt.B.M_1878_1:100
     5 M     1259  M.M.K_1259_1:100  
     6 M     1262  M.M.K_1262_1:100  
     7 M     1264  M.M.K_1264_1:100  
     8 M     1272  M.M.K_1272_1:100  
     9 SKO   30    SKO.Shard_30_1:100
    10 SKO   32    SKO.Shard_32_1:100
    11 SKO   45    SKO.Shard_45_1:100
    12 Zh    618   Zh.San_618_1:100  
    13 Zh    624   Zh.San_624_1:100  
    14 Zh    629   Zh.San_629_1:100  
    

    2) sub Using base R remove everything after the first dot giving A and then match everything to the underscore, capture to the next underscore and match the rest keeping the capture.

    transform(df, 
      A = sub("\\..*", "", SampleID),
      B = sub(".*_(.*)_.*", "\\1", SampleID)
    )
    

    Variations of this are possible:

    # 2a - this variation uses the same regex for both A and B
    pat <- "([^.]*).*_(.*)_.*"
    transform(df, 
      A = sub(pat, "\\1", SampleID),
      B = sub(pat, "\\2", SampleID)
    )
    
    # 2b - this variation uses trimws to simplify the regexes needed
    transform(df, 
      A = sub("\\..*", "", SampleID),
      B = trimws(SampleID,, "[^_]") |> trimws(, "_")
    )
    
    # 2c - same but use Reduce to iteratively apply trimws
    trimws2 <- function(x, y) trimws(x,, y)
    transform(df, 
      A = sub("\\..*", "", SampleID),
      B = Reduce(trimws2, init = SampleID, c("[^_]", "_"))
    )
    
    # 2d - use scan to avoid needing regular expressions
    Scan <- function(x, ...) scan(text = x, what = "", quiet = TRUE, ...)
    transform(df,
      A = Scan(SampleID, comment.char = "."),
      B = Scan(SampleID, sep = "_")[c(FALSE, TRUE, FALSE)]
    )
    

    3) separate_wider_delim If we replace the first dot with underscore then we can use separate_wider_delim

    library(dplyr)
    library(tidyr)
    
    df %>%
      mutate(tmp = sub("\\.", "_", SampleID)) %>%
      separate_wider_delim(tmp, names = c("A", NA, "B", NA), delim = "_")
    

    4) strsplit Using Base R use strsplit to split the elements of SampleID by dot and underscore and then take the first and second last of each such vector.

    L <- strsplit(df$SampleID, "[._]")
    transform(df,
      A = sapply(L, head, 1),
      B = sapply(L, \(x) tail(x, 2)[1])
    )
    

    5) strcapture With Base R use strapture to extract the matches to the capture groups in the regular expression. pat is from (2).

    transform(df, part = strcapture(pat, SampleID, list(A = "", B = "")))
    

    Note

    df <- data.frame(SampleID = c(
        "Akt.B.M_1867_1:100", "Akt.B.M_1869_1:100", "Akt.B.M_1871_1:100",
        "Akt.B.M_1878_1:100", "M.M.K_1259_1:100", "M.M.K_1262_1:100",
        "M.M.K_1264_1:100", "M.M.K_1272_1:100", "SKO.Shard_30_1:100",
        "SKO.Shard_32_1:100", "SKO.Shard_45_1:100", "Zh.San_618_1:100",
        "Zh.San_624_1:100", "Zh.San_629_1:100"
      )
    )