Search code examples
regexrlookbehind

Get more than 1 quotations in text paragraph in R regex


First: Find the texts that are inside the quotations "I want everything inside here".

Second: To extract 1 sentence before quotation.

I would like to achieve this output desirable by look behind regex in R if possible

Example:

Yoyo. He is sad. Oh no! "Don't sad!" Yeah: "Testing...  testings," Boys. Sun. Tree... 0.2% green,"LL" "WADD" HOLA.

Desired Output:

[1] Oh no! "Don't sad!"
[2] Yeah: "Testing... testings"
[3] Tree... 0.2% green, "LL"
[4] Tree... 0.2% green, "LL" "WADD"

dput:

"Yoyo. He is sad. Oh no! \"Don't sad!\" Yeah: \"Testing...  testings,\" Boys. Sun. Tree... 0.2% green,\"LL\" \"WAAD\" HOLA."

Tried using this but can't work:

str_extract(t, "(?<=\\.\\s)[^.:]*[.:]\\s*\"[^\"]*\"")

Also tried:

regmatches(t , gregexpr('^[^\\.]+[\\.\\,\\:]\\s+(.*(?:\"[^\"]+\\")).*$', t))

regmatches(t , gregexpr('\"[^\"]*\"(?<=\\s[.?][^\\.\\s])', t))

Tried your method @naurel:

> regmatches(t, regexpr("(?:\"? *([^\"]*))(\"[^\"]*\")", t, perl=T))
[1] " Yoyo. He is sad. Oh no! \"Don't sad!\""

Solution

  • Since you just want the last sentence I've cleared the regex for you : result

    Explanation : First you're looking for something that is between quotes. And if there is multiples quotes successively you want them to match as one.

    (\"[^\"]*\"(?: *\"[^\"]*\")*)
    

    Does the trick. Then you want to match the sentence before this group. A sentence is starting with a CAPITAL letter. So we will start the match to the first capital encounter before the previously defined group (ie : not followed by any other CAPITAL letter)

    ([A-Z](?:[a-z0-9\W\s])*)
    

    Put it togeither and you obtain :

    ([A-Z](?:[a-z0-9\W\s])*)(\"[^\"]*\"(?: *\"[^\"]*\")*)