I want to merge multiple collections.Counter
objects, where the counts are weighted with a floating point factor before they are added (see code). Each counter object has about 5000 keys and most (but not all) of the keys are shared among all counter objects.
My current code is extremely slow. I use it to merge about 4000 counter objects by merging them in sequence using the following function:
def addCounters(c1, c2, factor = 1.0):
merged = c1
for word, count in c2.iteritems():
if word in merged:
merged[word] = float(merged[word]) + float(count) * factor
else:
merged[word] = float(count) * factor
return merged
According to my cProfile
measurements 4181 calls will run in unacceptably slow 25 seconds. This is very annoying since it also freezes my GUI.
ncalls tottime percall cumtime percall filename:lineno(function)
4181 1.056 0.000 25.088 0.006 MyClass.py:30(addCounters)
Does anyone know of a much faster way to do this?
Partially inspired by Mr. F I came up with something, which runs for 4.3 seconds on my local computer. I'd still be very happy if someone can make it even faster. My approach works with both, dict
and collections.Counter
Basically, I "export" the keys into sets and then divide them into 3 groups:
c1
c2
keys occuring in both.
c1
and c2
, their counts can be calculated easily againdef addCounters(c1, c2, factor = 1.0):
#make sure factor is float
factor = float(factor)
#get keys as sets
c1keys, c2keys = set(c1.keys()), set(c2.keys())
#keys only in c1
c1only = c1keys.difference(c2keys)
#keys only in c2
c2only = c1keys.difference(c1keys)
#keys guaranteed in both
both = c1keys.intersection(c2keys)
#generate dicts for keys which are unique to c1 or c2
c1onlydict = dict((w,c1[w]) for w in c1only)
c2onlydict = dict((w,c2[w]*factor) for w in c2only)
#merge
loners = dict(c1onlydict, **c2onlydict)
#now create the directory with keys which exist in both
bothdict = dict((key, c1[key] + c2[key]*factor) for key in both)
#merge everything
return dict(loners, **bothdict)
I believe that the main reason why it runs faster is that when adding multiple counters/dicts one will always feed the previous cumulation back to the method as c1
. Therefore c1onlydict
(which is easiest to create and will probably have highly optimized bytecode) will become huge in size while c2onlydict
and bothdict
will be tiny by comparison.