I have a vector of character
. For each of these elements I am 100% sure there is a repetition that is always located at the start of the text.
A simplified example of a repeated sentence:
Hello. Hello. How are you?
Wait I aim for is just Hello. How are you?
Another example:
Hello I am Joe. Hello I am Joe. How are you?
In this case I would aim for: Hello I am Joe. How are you?
Another example of repetition:
Hello I a Hello I am Joe. How are you?
Another example of repetition:
Hello I am Jo Hello I am Joe. How are you?
In these cases, the desired output is still: Hello I am Joe. How are you?
Another example is the following:
Hello I am J Hello I am Joe. Joe is indeed my name
In this case, the desired output is:
Hello I am Joe. Joe is indeed my name
Notice that all the repetition happens before the desired output not in the middle, not in the end.
In my data I am sure that each text is at least of 440 characters and that this repeated text at the beginning is of random length, on average of 220 characters.
How about this?
libary(stringr)
str_remove(string, "(.*)\\s(?=\\1)")
[1] "Hello. How are you?" "Hello I am Joe. Joe is indeed my name" "Hello I am Joe. How are you?"
[4] "Hello I am Joe. How are you?" "Hello I am Joe. How are you?" "Hello I am Joe. Joe is indeed my name"
How this works:
(.*)
: capture group matching anything\\s
: one whitespace(?=\\1)
: positive lookahead asserting that what is captured in the capture group and 'remembered' by the backreference \\1
is getting repeated later in the string.Data (thanks to @giocomai):
string <- c("Hello. Hello. How are you?",
"Hello I am J Hello I am Joe. Joe is indeed my name",
"Hello I am Joe. Hello I am Joe. How are you?",
"Hello I a Hello I am Joe. How are you?",
"Hello I am Jo Hello I am Joe. How are you?",
"Hello I am J Hello I am Joe. Joe is indeed my name")