Search code examples
pythonfor-loopoptimizationtime-complexitypython-itertools

Can this python code be more efficient?


I have written some code to find how many substrings of a string are anagram pairs. The function to find anagram(anagramSolution) is of complexity O(N). The substring function has complexity less than N square. But, this code here is the problem. Can it be more optimized?

for i in range(T):
    x = raw_input()
    alist = get_all_substrings(x)

    for k, j in itertools.combinations(alist,2):
        if(len(k) == len(j)):
            if(anagramSolution(k,j)):
                counter +=1

    counterlist.append(counter)
    counter = 0

The alist can have thousands of items (subsets). The main problem is the loop. It is taking a lot of time to iterate over all the items. Is there any faster or more efficient way to do this?


Solution

  • Define the anagram class of a string to be the set of counts of how many times each letter appears in the string. For example, 'banana' has anagram class a: 3, b: 1, n: 2. Two strings are anagrams of each other if they have the same anagram class. We can count how many substrings of the string are in each anagram class, then compute the number of pairs by computing (n choose 2) for every anagram class with n substrings:

    from collections import Counter
    
    anagram_class_counts = Counter()
    
    for substring in get_all_substrings(x):
        anagram_class_counts[frozenset(Counter(substring).viewitems())] += 1
    
    anagram_pair_count = sum(x*(x-1)/2 for x in anagram_class_counts.viewvalues())
    

    frozenset(Counter(substring).viewitems()) builds a hashable representation of a string's anagram class.

    • Counter takes an iterable and builds a mapping representing how many times each item appeared, so
    • Counter(substring) builds a mapping representing a string's anagram class.
    • viewitems() gives a set-like collection of letter: count pairs, and
    • frozenset turns that into an immutable set that can be used as a dict key.

    These steps together take time proportional to the size of the substring; on average, substrings are about a third of the size of the whole string, so on average, processing each substring takes O(len(x)) time. There are O(len(x)**2) substrings, so processing all substrings takes O(len(x)**3) time.

    If there are x substrings with the same anagram class, they can be paired up in x*(x-1)/2 ways, so the sum goes through the number of occurrences of each anagram class and computes the number of pairs. This takes O(len(x)**2) time, since it has to go through each anagram class once, and there can't be more anagram classes than substrings.

    Overall, this algorithm takes O(len(x)**3) time, which isn't great, but it's a lot better than the original. There's still room to optimize this, such as by computing anagram classes in a way that takes advantage of the overlap between substrings or by using a more efficient anagram class representation.