Search code examples
pythonperformancecounterweightedcumulative-sum

Fastest way to accumulate weighted counts from multiple collections.Counter Objects?


I want to merge multiple collections.Counter objects, where the counts are weighted with a floating point factor before they are added (see code). Each counter object has about 5000 keys and most (but not all) of the keys are shared among all counter objects.

My current code is extremely slow. I use it to merge about 4000 counter objects by merging them in sequence using the following function:

def addCounters(c1, c2, factor = 1.0):
    merged = c1
    for word, count in c2.iteritems():
        if word in merged:
            merged[word] = float(merged[word]) + float(count) * factor
        else:
            merged[word] = float(count) * factor
    return merged

According to my cProfile measurements 4181 calls will run in unacceptably slow 25 seconds. This is very annoying since it also freezes my GUI.

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
4181    1.056    0.000   25.088    0.006   MyClass.py:30(addCounters)

Does anyone know of a much faster way to do this?


Solution

  • Partially inspired by Mr. F I came up with something, which runs for 4.3 seconds on my local computer. I'd still be very happy if someone can make it even faster. My approach works with both, dict and collections.Counter

    Basically, I "export" the keys into sets and then divide them into 3 groups:

    1. keys unique to c1
    2. keys unique to c2
    3. keys occuring in both.

      • The counts from group 1 are already final since we do c1+factor*c2 and there is no count in c2
      • Counts from group 2 only need to be multiplied with the factor since there is no count in c1
      • Now, keys in group 3 are guaranteed to exist in both c1 and c2, their counts can be calculated easily again
      • Finally everything is merged together

    def addCounters(c1, c2, factor = 1.0):

    #make sure factor is float
    factor = float(factor)
    #get keys as sets 
    c1keys, c2keys = set(c1.keys()), set(c2.keys())
    #keys only in c1
    c1only = c1keys.difference(c2keys)
    #keys only in c2
    c2only = c1keys.difference(c1keys)
    #keys guaranteed in both
    both = c1keys.intersection(c2keys)
    
    #generate dicts for keys which are unique to c1 or c2
    c1onlydict = dict((w,c1[w]) for w in c1only)
    c2onlydict = dict((w,c2[w]*factor) for w in c2only)
    #merge
    loners = dict(c1onlydict, **c2onlydict)
    
    #now create the directory with keys which exist in both
    bothdict = dict((key, c1[key] + c2[key]*factor) for key in both)
    
    #merge everything
    return dict(loners, **bothdict)
    

    I believe that the main reason why it runs faster is that when adding multiple counters/dicts one will always feed the previous cumulation back to the method as c1. Therefore c1onlydict (which is easiest to create and will probably have highly optimized bytecode) will become huge in size while c2onlydict and bothdict will be tiny by comparison.