Search code examples
rstringrstringi

How to remove repeated sentences with stringi?


I have a vector of character. For each of these elements I am 100% sure there is a repetition that is always located at the start of the text.

A simplified example of a repeated sentence:

Hello. Hello. How are you?

Wait I aim for is just Hello. How are you?

Another example:

Hello I am Joe. Hello I am Joe. How are you?

In this case I would aim for: Hello I am Joe. How are you?

Another example of repetition:

Hello I a Hello I am Joe. How are you?

Another example of repetition:

Hello I am Jo Hello I am Joe. How are you?

In these cases, the desired output is still: Hello I am Joe. How are you?

Another example is the following:

Hello I am J Hello I am Joe. Joe is indeed my name

In this case, the desired output is:

Hello I am Joe. Joe is indeed my name

Notice that all the repetition happens before the desired output not in the middle, not in the end.

In my data I am sure that each text is at least of 440 characters and that this repeated text at the beginning is of random length, on average of 220 characters.


Solution

  • How about this?

    libary(stringr)
    str_remove(string, "(.*)\\s(?=\\1)")
    [1] "Hello. How are you?"                   "Hello I am Joe. Joe is indeed my name" "Hello I am Joe. How are you?"         
    [4] "Hello I am Joe. How are you?"          "Hello I am Joe. How are you?"          "Hello I am Joe. Joe is indeed my name"
    

    How this works:

    • (.*): capture group matching anything
    • \\s: one whitespace
    • (?=\\1): positive lookahead asserting that what is captured in the capture group and 'remembered' by the backreference \\1 is getting repeated later in the string.

    Data (thanks to @giocomai):

    string <- c("Hello. Hello. How are you?", 
                "Hello I am J Hello I am Joe. Joe is indeed my name",
                "Hello I am Joe. Hello I am Joe. How are you?",
                "Hello I a Hello I am Joe. How are you?",
                "Hello I am Jo Hello I am Joe. How are you?",
                "Hello I am J Hello I am Joe. Joe is indeed my name")