Search code examples
algorithmdata-structuresformula

Given a string 's' of length n, calculate the number of occurrences of a character 'c' of s in all substrings of s


Consider the string aba.

All of its substrings are: [a,b,a,ab,ba,aba].

In the above list, character 'a' appears 6 times and b appears 4 times.

Is there any formula for calculating this? that is the number of occurrences of a character 'c' of s in all substrings of s.


Solution

  • In a string of length n, there are n(n+1)/2 substrings.

    The character at index i appears once in every substring, except for the substrings that end at character i-1 or before, and except for the substrings that start at character i+1 or after. Note that here I'm only counting this specific occurrence of the character at index i.

    Let me rephrase that: in a string S c T, where S and T are two substrings and c is a character, this occurrence of character c appears once in every substring of S c T, except in substrings of S and in substrings of T.

    Thus the number of substrings in which appears the character at index i in a string of length n is:

    k_i = n(n+1)/2 - i(i+1)/2 - (n-i-1)(n-i)/2
    

    If we expand, cancel out terms, and refactor:

    k_i = n(n+1)/2 - i(i+1)/2 - (n-i-1)(n-i)/2
    k_i = (n(n+1) - i(i+1) - (n-i-1)(n-i)) / 2
    k_i = (n²+n -i²-i -n²+ni+in-i²+n-i) / 2
    k_i = (2n + 2ni -2i - 2i²) / 2
    k_i = n + ni - i - i²
    k_i = n(i+1) - i(i+1)
    k_i = (i+1) (n-i)
    

    We can sum on all occurrences of this character. In python, this gives the following function:

    def count_occurences_in_substrings(s, c):
        n = len(s)
        return sum(
            (i + 1) * (n - i)
            for i, a in enumerate(s) if a == c
        )
    

    Testing by comparing against a bruteforce function that actually generates the substrings:

    from itertools import combinations
    
    def bruteforce(s, c):
        return ''.join(s[i:j] for i,j in combinations(range(len(s)+1), 2)).count(c)
    
    for s in ['aba', 'abcdefg', 'bcdaefg']:
        k1 = count_occurences_in_substrings(s, 'a')
        k2 = bruteforce(s, 'a')
        print('{:8s}: {:2d},{:2d}'.format(s, k1, k2))
    # aba     :  6, 6
    # abcdefg :  7, 7
    # bcdaefg : 16,16