Search code examples
rregexstringi

R: Explode string but keep quoted text as a single word


I encountered this question: PHP explode the string, but treat words in quotes as a single word

and similar dealing with using Regex to explode words in a sentence, separated by a space, but keeping quoted text intact (as a single word).

I would like to do the same in R. I have attempted to copy-paste the regular expression into stri_split in the stringi package as well as strsplit in base R, but as I suspect the regular expression uses a format R does not recognize. The error is:

Error: '\S' is an unrecognized escape in character string...

The desired output would be:

mystr <- '"preceded by itself in quotation marks forms a complete sentence" preceded by itself in quotation marks forms a complete sentence'

myfoo(mystr)

[1] "preceded by itself in quotation marks forms a complete sentence" "preceded" "by" "itself" "in" "quotation" "marks" "forms" "a" "complete" "sentence"

Trying: strsplit(mystr, '/"(?:\\\\.|(?!").)*%22|\\S+/') gives:

Error in strsplit(mystr, "/\"(?:\\\\.|(?!\").)*%22|\\S+/") : 
  invalid regular expression '/"(?:\\.|(?!").)*%22|\S+/', reason 'Invalid regexp'

Solution

  • A simple option would be to use scan:

    > x <- scan(what = "", text = mystr)
    Read 11 items
    > x
     [1] "preceded by itself in quotation marks forms a complete sentence"
     [2] "preceded"                                                       
     [3] "by"                                                             
     [4] "itself"                                                         
     [5] "in"                                                             
     [6] "quotation"                                                      
     [7] "marks"                                                          
     [8] "forms"                                                          
     [9] "a"                                                              
    [10] "complete"                                                       
    [11] "sentence"