Search code examples
rregexstringstring-matching

Splitting a comma- and semicolon-delimited string in R


I'm trying to split a string containing two entries and each entry has a specific format:

  • Category (e.g. active site/region) which is followed by a :
  • Term (e.g. His, Glu/nucleotide-binding motif A) which is followed by a ,

Here's the string that I want to split:

string <- "active site: His, Glu,region: nucleotide-binding motif A,"

This is what I have tried so far. Except for the two empty substrings, it produces the desired output.

unlist(str_extract_all(string, ".*?(?=,(?:\\w+|$))"))

[1] "active site: His, Glu"              ""                                   "region: nucleotide-binding motif A"
[4] "" 

How do I get rid of the empty substrings?


Solution

  • You get the empty strings because .*? can also match an empty string where this assertion (?=,(?:\\w+|$)) is true

    You can exclude matching a colon or comma using a negated character class before matching :

    [^:,\n]+:.*?(?=,(?:\w|$))
    

    Explanation

    • [^:,\n]+ Match 1+ chars other than : , or a newline
    • : Match the colon
    • .*? Match any char as least as possbiel
    • (?= Positive lookahead, assert that what is directly to the right from the current position:
      • , Match literally
      • (?:\w|$) Match either a single word char, or assert the end of the string
    • ) Close the lookahead

    Regex demo | R demo

    string <- "active site: His, Glu,region: nucleotide-binding motif A,"
    unlist(str_extract_all(string, "[^:,\\n]+:.*?(?=,(?:\\w|$))"))
    

    Output

    [1] "active site: His, Glu"              "region: nucleotide-binding motif A"