pythonperformancedictionaryvectorization

Extracting elements from a long list of dictionaries efficiently


I have a (long) list of dictionaries, but for the sake of this example I represent them as

d = [{'a':1}, {'a':2}, {'a':3}]

I need to extract the same element from these dictionaryes, i.e.,

[i['a'] for i in d]

What is the most efficient way to do this in Python? List comprehensions and for-loops work well, but are not known to be very efficient. Can the process be vectorized somehow?


Additional details: The dictionaries have multiple keys, but it is the same one that I need to extract. All the dictionaries have the same keys.


Solution

  • Use pandas. You have to pay the upfront costs of import and creating a data frame. But the subsequent operations are vectorized and efficient:

    import pandas as pd
    
    d = [{'a':1, 'b':11}, {'a':2, 'b':12}, {'a':3, 'b':13}]
    df = pd.DataFrame(d)
    print(df['a'])
    print(list(df['a']))
    

    Prints:

    0    1
    1    2
    2    3
    Name: a, dtype: int64
    
    [1, 2, 3]
    

    Benchmarking:

    The results is that pandas data frame is slightly faster than the list of dictionaries for medium-size datasets, not including the cost of creating the data frame.

    The benchmarking code is based on the answer by JL Peyret. Note that, unlike in that answer, I place data frame initialization (not just import pandas) outside the benchmarking loop. I benchmark simply access to the elements of the data structure. The data structure also has more rows (1 million). I assume that this scenario is an alternative realistic scenario for medium-size datasets.

    import random
    import pandas as pd
    import timeit
    
    def pandas_dict(datain):
        return list(df["a"])
    
    def list_comp(datain):
        return [v["a"] for v in datain]
    
    nrows = 1000000
    
    rand = list(range(nrows))
    random.shuffle(rand)
    
    data = [{"a":rand[i], "b":rand[i]} for i in range(nrows)]
    
    df = pd.DataFrame(data)
    print(df)
    
    results = []
    
    for totest in [list_comp, pandas_dict]:
        print(totest.__name__)
        print(timeit.timeit(stmt='totest(data)', number=100, globals=globals()))
        res = totest(data)
        results.append(res)
        print(f"{res[0:10]=}, {len(res)=}")
    

    Results:

                 a       b
    0       847669  847669
    1       777701  777701
    2       446229  446229
    3       984577  984577
    4       813383  813383
    ...        ...     ...
    999995  636811  636811
    999996  413271  413271
    999997  346275  346275
    999998  414864  414864
    999999  381832  381832
    
    [1000000 rows x 2 columns]
    list_comp
    4.460098167066462
    res[0:10]=[847669, 777701, 446229, 984577, 813383, 705699, 7830, 466819, 485673, 400344], len(res)=1000000
    pandas_dict
    4.201689457986504
    res[0:10]=[847669, 777701, 446229, 984577, 813383, 705699, 7830, 466819, 485673, 400344], len(res)=1000000
    

    See also:

    Pandas DataFrame performance compared to dictionary