Search code examples
pythondictionarycountercountingdefaultdict

How to reverse index a Counter and convert it into a defaultdict? Python


Given a Counter, e.g.:

>>> from collections import Counter
>>> Counter('123112415121361273')
Counter({'1': 7, '2': 4, '3': 3, '5': 1, '4': 1, '7': 1, '6': 1})

How can I reverse the index and get the counts as keys and the values as lists of original string keys?

The purpose is to convert the example above into something like this:

defaultdict(<type 'list'>, {1: ['5', '4', '7', '6'], 3: ['3'], 4: ['2'], 7: ['1']})

I have tried tried manually reiterating through the Counter:

>>> from collections import Counter
>>> Counter('123112415121361273')
Counter({'1': 7, '2': 4, '3': 3, '5': 1, '4': 1, '7': 1, '6': 1})
>>> x = Counter('123112415121361273')
>>> from collections import Counter, defaultdict
>>> y = defaultdict(list)
>>> for s, count in x.items():
...     y[count].append(s)
... 
>>> y
defaultdict(<type 'list'>, {1: ['5', '4', '7', '6'], 3: ['3'], 4: ['2'], 7: ['1']})

But is there any other way to do this?

Since the input is the string '123112415121361273' and the output is supposed to be the dictionary indexed by counts, is there any way to avoid the counting step when iterating through it the first time and get to the resulting defaultdict?


Solution

  • No, there is no more efficient way.

    Counting is best done with a mapping, which is exactly what Counter does. Since you don't know the final count for any character until you fully traversed the string, you can't know up-front what bucket to file a character into until you have completed counting.

    So the in-efficient alternative is to start with a mapping from count to characters, then move characters up to the next bucket as you find that they already have a count. Finding that they already have a count requires that you test against every bucket, so that becomes a O(NK) solution as opposed to your straightforward linear O(N) solution that the Counter gives you.

    ## Warning: this is not an efficient approach; use for illustration purposes only
    
    from collections import defaultdict
    
    s = '123112415121361273'
    count_to_char = defaultdict(set)  # use a set to avoid O(N**2) performance
    max_count = 0
    for char in s:  # loop over N items
        for i in range(1, max_count + 1):  # loop over up to K buckets
            if char in count_to_char[i]:
                count_to_char[i].remove(char)
                count_to_char[i + 1].add(char)
                break
        else:
            i = 0
            count_to_char[1].add(char)
        max_count = max(i + 1, max_count)
    # remove empty buckets again
    for count in [c for c, b in count_to_char.items() if not b]:
        del count_to_char[count]
    # alternative method to clear empty buckets, producing a regular dict
    # count_to_char = {c: b for c, b in count_to_char.items() if b}
    

    The way to avoid that scan over K buckets is to use a counter, which you already use.