Search code examples
pythonregexeda

Regex remove duplicate phrases in multiline string


What is the problem:

I have a multiline text, for example:

1: This is test string for my app. d
2: This is test string for my app.
3: This is test string for my app. abcd
4: This is test string for my app.
5: This is test string for my app.
6: This is test string for my app.
7: This is test string for my app. d
8: This is test string for my app.
9: This is test string for my app.
10: This is another string.

The line numbers are only for better visualization in here, they are not part of the text itself.

What I have tried:

I have a tried two different Regex (flags are always: i g and m):

^([^\r\n]*)$(.*?)(?:(?:\r?\n|\r)\1)+$

see here: regexr.com/5nklg

and

^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)

see here: regexr.com/5nkla

They both produce different outputs, both are good, but not perfect.

What I would like to achieve:

Remove all duplicate phrases in the text, but keep one. So here for example keep the first "This is test string for my app." from line 1, match the same phrase on line 2 - 9 and keep number 10.

It would alsow work for me if I can keep the last instead of the first matching phrase. So here this would be match line 1 - 8, keep 9 and 10.

Is there a way to do this with Regex?

FYI: I will use the Regex in python later to sub the duplicates out:

re.sub(r"^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)", "", my_text, flags=re.MULTILINE)

EDIT: a 'phrase' means let's say 3 or more words. so match any duplication that is longer than 2 words. so the expected output after the first sub would be:

This is test string for my app. d  //from line 1
This is test string for my app.    //from line 2
abcd                               //from line 3
This is another string.            //from line 10

Thanks in advance!


Solution

  • You can use

    re.sub(r'^(([^\n\r.]*).*)(?:(?:\r?\n|\r)\2.*)*', r'\1', my_text, flags=re.M)
    

    See the regex demo.

    Details:

    • ^ - start of a line (since the re.M option is used, ^ now matches line start positions)
    • (([^\n\r.]*).*) - Group 1: zero or more chars other than a dot, CR and LF captured into Group 2, and then the rest of the line
    • (?:(?:\r?\n|\r)\2.*)* - zero or more sequences of
      • (?:\r?\n|\r) - a CRLF, CR or LF line ending
      • \2 - same text as in Group 2
      • .* - the rest of the line.

    The replacement is the Group 1 value.