Search code examples
pythonstringtextmining

Find pattern in python string with flexible length of each pattern component


I have a string:

str_x = "121001221122010120211122211122222222112222"

I want to find out how many times a given pattern is observed in the string, but the pattern should be seen as flexible:

The pattern I'm looking for is thus:

  • at least three 2's followed by at least two 1's followed by at least three 2's

A pattern satisfying this condition will thus for example be "22211222", but also "2222111222" and "222222221111111111222"

I want to find out how many times this "flexible pattern" is seen in str_x.

The correct answer here is 2 times.

Any ideas how to do this? Thanks a bunch.

EDIT

Given the definition I placed above, the answer of 2 times is actually incorrect, since valid patterns overlap... e.g. "222111222", "2221112222", "22211122222" etc. are all patterns satisfying the objective.

What I want is to find the number of patterns which do not overlap (that is, still 2 times)


Solution

  • You have to use regex to solve your problem: https://docs.python.org/2/library/re.html

    The regular expression:
    regex = r"2{3,}?1{2,}?2{3,}?"
    means = find at least three 2's followed by at least two 1's followed by at least three 2's

    notation 2{3,} means find all at least three 2's
    ? means - greedy search - the search that may overlap
    If you want to find patterns that do not overlap - just remove ?

    import re
    
    regex = r"2{3,}?1{2,}?2{3,}?"
    
    test_str = "121001221122010120211122211122222222112222"
    
    matches = re.finditer(regex, test_str)
    
    for matchNum, match in enumerate(matches):
        matchNum = matchNum + 1
    
        print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    print ("total matches: {matches}".format(matches= matchNum))