I'm trying to split a string containing two entries and each entry has a specific format:
active site
/region
) which is followed by a :
His, Glu
/nucleotide-binding motif A
) which is followed by a ,
Here's the string that I want to split:
string <- "active site: His, Glu,region: nucleotide-binding motif A,"
This is what I have tried so far. Except for the two empty substrings, it produces the desired output.
unlist(str_extract_all(string, ".*?(?=,(?:\\w+|$))"))
[1] "active site: His, Glu" "" "region: nucleotide-binding motif A"
[4] ""
How do I get rid of the empty substrings?
You get the empty strings because .*?
can also match an empty string where this assertion (?=,(?:\\w+|$))
is true
You can exclude matching a colon or comma using a negated character class before matching :
[^:,\n]+:.*?(?=,(?:\w|$))
Explanation
[^:,\n]+
Match 1+ chars other than :
,
or a newline:
Match the colon.*?
Match any char as least as possbiel(?=
Positive lookahead, assert that what is directly to the right from the current position:
,
Match literally(?:\w|$)
Match either a single word char, or assert the end of the string)
Close the lookaheadstring <- "active site: His, Glu,region: nucleotide-binding motif A,"
unlist(str_extract_all(string, "[^:,\\n]+:.*?(?=,(?:\\w|$))"))
Output
[1] "active site: His, Glu" "region: nucleotide-binding motif A"