Search code examples
regexrnlpdata-cleaning

Trim pattern in a text between \n\n\n\n


I am cleaning text in R. My text has the form

but he could not avoid the subject FULLSTOP \n\n\n\n\nsimilar pieces by the author\n\n\nlife is great 13022015\nreal men don t eath quiche 22042013\nback to the future 01072012\n\n\n\n and as he takes the stage here wednesday night to rally democrats around hillary clinton mr FULLSTOP obama will revisit his own promise to guide the nation into an era of reconciliation and unity harking back to the themes that propelled his improbable rise but that seem even more out of reach today FULLSTOP \n\n\n\n\nobama at convention to lay out stakes for a divided nation \n\n\n\n we get frustrated with political gridlock worry about racial divisions are shocked and saddened by the madness of orlando or nice mr FULLSTOP

I'm trying to get rid of

\n\n\n\n\nsimilar pieces by the author\n\n\nlife is great 13022015\nreal men don t eath quiche 22042013\nback to the future 01072012\n\n\n\n

so to obtain something like

but he could not avoid the subject FULLSTOP and as he takes the stage here wednesday night to rally democrats around hillary clinton mr FULLSTOP obama will revisit his own promise to guide the nation into an era of reconciliation and unity harking back to the themes that propelled his improbable rise but that seem even more out of reach today FULLSTOP \n\n\n\n\nobama at convention to lay out stakes for a divided nation \n\n\n\n we get frustrated with political gridlock worry about racial divisions are shocked and saddened by the madness of orlando or nice mr FULLSTOP

I'm trying with something like

gsub("\\\n{3,}(similar pieces)?.*\\\n{3,}", "", my_string)or gsub("\\\n{3,}(similar pieces)?.*?\\\n{3,}", "", my_string)

But it overtrims or does not work.

Any help (as well as an explanation of what I'm doing wrong and why the alternative works) would be very appreciated.


Solution

  • You need to match everything between the first 5 newline symbols up to the first 4 newline symbols.

    I suggest a *\n{5}.*?\n{4} * regex:

    • * - zero or more literal spaces
    • \n{5} - 5 newline symbols
    • .*? - zero or more any characters up to the first....
    • \n{4} - 4 LF symbols
    • * - zero or more literal spaces (just to trim the match)

    and replace with a space.

    Use sub since you only need 1 replacement:

    sub(" *\n{5}.*?\n{4} *", " ", s)
    

    See R demo