Assume I have two very large dictionaries: bigDictA and bigDictB, so something like below.
bigDictA = {frozenset("a1","b2","c3"): [some floats in a list], ...}
bigDictB = {frozenset("a1","b2"): [some floats in a list], ...}
Now, the algorithm I wrote which I need help optimizing looks like this:
setOfMatches = set()
for bigDictAkey, bigDictAval in bigDictA.items():
for bigDictBkey, bigDictBval in bigDictB.items():
itemsInCommon = list(frozenset.intersection(bigDictAkey,bigDictBkey))
numberOfItemsInCommon = len(itemsInCommon)
valForComparison = THRESHOLD*float(len(list(bigDictAkey)))
if (numberOfItemsInCommon >= valForComparison):
setOfMatches.add(bigDictAkey)
So if THRESHOLD was 0.3 frozenset("a1","b2","c3") would be added to setOfMatches, but if THRESHOLD was 0.7 it wouldn't be added.
I realize this isn't efficient, but I am definitely open to any suggestions including converting the key datatype to some other datatype for a speed up. I have also looked into using tools like Numba and Cython as well (I prefer to keep it in pure python though). It needs to be crazy fast!
Any help is really appreciated! Thanks!
bigDictA
and bigDictB
, so you don't need to use the dict.items()
method.len()
works on sets too, so no need to convert a set to a list in order to get its size.bigDictAkey
to setOfMatches
, there is no need to test the rest of the items in bigDictB
and you should immediately break
the inner loop to check the next item in bigDictA
.The improved code as follows:
setOfMatches = set()
for bigDictAkey in bigDictA:
for bigDictBkey in bigDictB:
numberOfItemsInCommon = len(bigDictAkey & bigDictBkey)
valForComparison = THRESHOLD*len(bigDictAkey)
if (numberOfItemsInCommon >= valForComparison):
setOfMatches.add(bigDictAkey)
break