pythonawksedgrep

More than 6 characters string repeated


I am trying to find the repeated strings (not words) from text.

x = 'This is a sample text and this is lowercase text that is repeated.'

In this example, the string ' text ' should not return because only 6 characters match with one another. But the string 'his is ' is the expected value returned.

I tried using range, Counter and regular expression.

import re
from collections import Counter

duplist = list()
for i in range(1, 30):
  mylist = re.findall('.{1,'+str(i)+'}', x)
  duplist.append([k for k,v in Counter(mylist).items() if v>1])


Solution

  • You can use a quantifier of {7,} to ensure that a match is more than 6 characters long, and use a positive lookahead pattern with a backreference to assert that the captured string is repeated:

    import re
    
    x = 'This is a sample text and this is lowercase text that is repeated.'
    print(re.findall(r'(.{7,})(?=.*\1)', x, re.S))
    

    This outputs:

    ['his is ', 'e text ']
    

    Demo: https://ideone.com/jZvQR5