Search code examples
pythondictionaryoptimizationset-intersectionfrozenset

Using Python, what is the fastest way to compare two large dictionaries while returning keys (as frozensets) that match past a certain threshold?


Assume I have two very large dictionaries: bigDictA and bigDictB, so something like below.

bigDictA = {frozenset("a1","b2","c3"): [some floats in a list], ...}
bigDictB = {frozenset("a1","b2"): [some floats in a list], ...}

Now, the algorithm I wrote which I need help optimizing looks like this:

setOfMatches = set()
for bigDictAkey, bigDictAval in bigDictA.items():
    for bigDictBkey, bigDictBval in bigDictB.items():
        itemsInCommon = list(frozenset.intersection(bigDictAkey,bigDictBkey))
        numberOfItemsInCommon = len(itemsInCommon)
        valForComparison = THRESHOLD*float(len(list(bigDictAkey)))
        if (numberOfItemsInCommon >= valForComparison):
            setOfMatches.add(bigDictAkey)

So if THRESHOLD was 0.3 frozenset("a1","b2","c3") would be added to setOfMatches, but if THRESHOLD was 0.7 it wouldn't be added.

I realize this isn't efficient, but I am definitely open to any suggestions including converting the key datatype to some other datatype for a speed up. I have also looked into using tools like Numba and Cython as well (I prefer to keep it in pure python though). It needs to be crazy fast!

Any help is really appreciated! Thanks!


Solution

    1. You aren't using the values of bigDictA and bigDictB, so you don't need to use the dict.items() method.
    2. len() works on sets too, so no need to convert a set to a list in order to get its size.
    3. As soon as there's one match above the threshold and you add bigDictAkey to setOfMatches, there is no need to test the rest of the items in bigDictB and you should immediately break the inner loop to check the next item in bigDictA.

    The improved code as follows:

    setOfMatches = set()
    for bigDictAkey in bigDictA:
        for bigDictBkey in bigDictB:
            numberOfItemsInCommon = len(bigDictAkey & bigDictBkey)
            valForComparison = THRESHOLD*len(bigDictAkey)
            if (numberOfItemsInCommon >= valForComparison):
                setOfMatches.add(bigDictAkey)
                break