Search code examples
pythonsubstringfrequency

Counting the frequency of a repeating character substring in Python? Consider max repetition of the character not the min as seen in count( )


The desired function calls and outputs of required_function() are expected like followings:

>>> print(required_function("1112211000022", "1"))
0
>>> print(required_function("1112211000022", "11"))
1
>>> print(required_function("1112211000022", "111"))
1 
>>> print(required_function("1112211000022", "0"))
0
>>> print(required_function("1112211000022", "00"))
0
>>> print(required_function("1112211000022", "0000"))
1
>>> print(required_function("1112211000022", "22"))
2

Let's try python's count() function but it extracts non overlapping counts using minimum matching length of the search substring.

"1112211000022".count('11') #returns 2 but desired output is 1 because "11" is present only once.
"1112211000022".count('1') #returns 5 but desired output is 0 because "1" is not present.
"1112211000022".count('0') #returns 4 but desired output is 0 because "0" is not present.

Goal: This function can be helpful in counting occurrence of an amino acid within a protein sequence. I am also trying to write an efficient code, will share soon as I get it done. The function should be time efficient as there will be around 50,000 protein sequences of an average length 250 in a single file related to a single Specie.


Solution

  • If you are going to be doing this kind of analysis, you should become well-versed in regular expressions. They are a frequently-used tool for analyzing sequences like this. You current problem is handled easily by setting up a regex using negative lookahead/lookbehind that looks for the target that doesn't follow and isn't followed by the character in the pattern. One way to do it is:

    import re
    
    def required_function(s, t):
        c = t[0]
        rx = re.compile(f'(?<!{c}){t}(?!{c})')
        return len(rx.findall(s))
    
    assert(0 == required_function("1112211000022", "1"))
    assert(1 == required_function("1112211000022", "11"))
    assert(1 == required_function("1112211000022", "111"))
    assert(0 == required_function("1112211000022", "0"))
    assert(0 == required_function("1112211000022", "00"))
    assert(1 == required_function("1112211000022", "0000"))
    assert(2 == required_function("1112211000022", "22"))
    

    This will churn through 50,000 examples in a few milliseconds and could be improved by wisely re-using the compiled regex whenever possible.