Search code examples
pythonperformancedictionaryfilterdictionary-comprehension

Data filtering performance with dict comprehension


While trying to find the most efficient way to filter dictionaries, I stumbled upon a strange behaviour.

I made 4 tests, the first one filters the dictionary sequentially.

The second one does a single filtering with a combination of the rules (which is actually the most efficient way).

Then, I made attempt to make it more generic so that the filter may be used with an arbitrary number of predicates, that could be eventually user-defined, not hard-coded.

And I realised that combining predicated with all was much more inefficient than doing two filterings one after the other.

What can explain this ? Is this the all() function that has poor performance ? Would you suggest any other way to improve performance still in a generic way ?

# TEST 1 (Took 1.717677354812622s)
y = {k:x for k,x in y.items() if x['id'] >= 3 }
y = {k:x for k,x in y.items() if x['name'].find('a') != -1 }


# TEST 2 (Took 1.411365032196045s)
y = {k:x for k,x in y.items() if x['id'] >= 3 and x['name'].find('a') != -1  }

# TEST 3 (Took 3.4738941192626953s)
predicates = [
    lambda x: x['id'] >= 3,
    lambda x: x['name'].find('a') != -1
]
y = {k:x for k,x in y.items() if all([f(x) for f in predicates]) }

# TEST 4 (Took 2.4156315326690674s)
predicates = [
    lambda x: x['id'] >= 3,
]
y = {k:x for k,x in y.items() if predicates[0](x) }
predicates = [
    lambda x: x['name'].find('a') != -1
]
y = {k:x for k,x in y.items() if predicates[0](x) }

Testing bench :

from var_dump import var_dump
import time

start_time = time.time()

for p in range(0,1000000):
    users = {
        1: {'id': 1, 'name': "toto"},
        2: {'id': 2, 'name': "titi"},
        3: {'id': 3, 'name': "tata"},
        4: {'id': 4, 'name': "tutu"},
        5: {'id': 5, 'name': "john"},
        6: {'id': 6, 'name': "jane"}
    }

    y = users

    #-> test goes here

print(y)
print("--- %s seconds ---" % (time.time() - start_time))

Solution

  • all evaluates both conditions, whereas and in TEST_2 is evaluated lazily which means the second condition is evaluated for ids >= 3 only. Illustrating by simple print statements:

    users = {
        1: {'id': 1, 'name': "toto"},
        2: {'id': 2, 'name': "titi"},
        3: {'id': 3, 'name': "tata"},
        4: {'id': 4, 'name': "tutu"},
        5: {'id': 5, 'name': "john"},
        6: {'id': 6, 'name': "jane"}
    }
    
    def check_1(item):
        print("check_1")
        return item['id'] >= 3
    
    def check_2(item):
        print("check_2")
        return item['name'].find('a') != -1
    
    
    # all condition
    print("TEST AND")
    for k, v in users.items():
        check_1(v) and check_2(v)
    
    # all condition
    print("TEST ALL")
    for k, v in users.items():
        all([check_1(v), check_2(v)])
    

    Out:

    TEST AND
    check_1
    check_1
    check_1
    check_1
    check_2
    check_1
    check_2
    check_1
    check_2
    TEST ALL
    check_1
    check_2
    check_1
    check_2
    check_1
    check_2
    check_1
    check_2
    check_1
    check_2
    check_1
    check_2