Search code examples
rregexstringseparator

Separate string by start and finish strings


In R I have a series of strings like:

"New:\r\nRemote_UI: Apple CarPlay application cannot be started (P3_DA18018395_012) (91735)\r\nMedia: After an iPhone is authorised as BTA device for the first time, Entertainment volume is abruptly set to zero when the user picks a song from "Current tracklist" (DA18018395_015)\r\n\r\nKnown:\r\nHWR in navigation entry is not read out (89412)"

I would like to get something like:

New:
[1] Remote_UI: Apple CarPlay application cannot be started (P3_DA18018395_012) (91735)
[2] Media: After an iPhone is authorised as BTA device for the first time, Entertainment volume is abruptly set to zero when the user picks a song from "Current tracklist" (DA18018395_015)

Known:
[1] HWR in navigation entry is not read out (89412)

Notice there could be only "New", only "Known", none of them or both of them in different order. Any ideas? Thanks!


Solution

  • You may use

    x <- "New:\r\nRemote_UI: Apple CarPlay application cannot be started (P3_DA18018395_012) (91735)\r\nMedia: After an iPhone is authorised as BTA device for the first time, Entertainment volume is abruptly set to zero when the user picks a song from \"Current tracklist\" (DA18018395_015)\r\n\r\nKnown:\r\nHWR in navigation entry is not read out (89412)"
    New <- regmatches(x, gregexpr("(?:\\G(?!\\A)\\R+|New:\\R+)\\K.+(?!\\R+\\w+:\\R)", x, perl=TRUE))
    Known <- regmatches(x, gregexpr("(?:\\G(?!\\A)\\R+|Known:\\R+)\\K.+(?!\\R+\\w+:\\R)", x, perl=TRUE))
    

    See the R demo online.

    Output:

    [[1]]
    [1] "Remote_UI: Apple CarPlay application cannot be started (P3_DA18018395_012) (91735)\r"                                                                                                     
    [2] "Media: After an iPhone is authorised as BTA device for the first time, Entertainment volume is abruptly set to zero when the user picks a song from \"Current tracklist\" (DA18018395_015"
    
    [[1]]
    [1] "HWR in navigation entry is not read out (89412)"
    

    The regex used is

    (?:\G(?!\A)\R+|New:\R+)\K.+(?!\R+\w+:\R)
    

    See the regex demo online. The second regex differs from this one only in the literal word, Known.

    Details

    • (?:\G(?!\A)\R+|New:\R+) - the end of the previous match and 1+ line breaks (\G(?!\A)\R+) or (|) New: and then 1 or more line breaks (\R+)
    • \K - match reset operator discarding the whole text matched so far
    • .+ - 1+ chars other than line break chars as many as possible
    • (?!\R+\w+:\R) - a negative lookahead that fails the match if, immediately to the right of the current location, there are:
      • \R+ - 1+ line breaks,
      • \w+ - 1+ word chars
      • : - a colon
      • \R - a line break.