python performance dictionary vectorization

Extracting elements from a long list of dictionaries efficiently

I have a (long) list of dictionaries, but for the sake of this example I represent them as

d = [{'a':1}, {'a':2}, {'a':3}]

I need to extract the same element from these dictionaryes, i.e.,

[i['a'] for i in d]

What is the most efficient way to do this in Python? List comprehensions and for-loops work well, but are not known to be very efficient. Can the process be vectorized somehow?

Additional details: The dictionaries have multiple keys, but it is the same one that I need to extract. All the dictionaries have the same keys.

Solution

Use pandas. You have to pay the upfront costs of import and creating a data frame. But the subsequent operations are vectorized and efficient:

import pandas as pd

d = [{'a':1, 'b':11}, {'a':2, 'b':12}, {'a':3, 'b':13}]
df = pd.DataFrame(d)
print(df['a'])
print(list(df['a']))

Prints:

0    1
1    2
2    3
Name: a, dtype: int64

[1, 2, 3]

Benchmarking:

The results is that pandas data frame is slightly faster than the list of dictionaries for medium-size datasets, not including the cost of creating the data frame.

The benchmarking code is based on the answer by JL Peyret. Note that, unlike in that answer, I place data frame initialization (not just import pandas) outside the benchmarking loop. I benchmark simply access to the elements of the data structure. The data structure also has more rows (1 million). I assume that this scenario is an alternative realistic scenario for medium-size datasets.

import random
import pandas as pd
import timeit

def pandas_dict(datain):
    return list(df["a"])

def list_comp(datain):
    return [v["a"] for v in datain]

nrows = 1000000

rand = list(range(nrows))
random.shuffle(rand)

data = [{"a":rand[i], "b":rand[i]} for i in range(nrows)]

df = pd.DataFrame(data)
print(df)

results = []

for totest in [list_comp, pandas_dict]:
    print(totest.__name__)
    print(timeit.timeit(stmt='totest(data)', number=100, globals=globals()))
    res = totest(data)
    results.append(res)
    print(f"{res[0:10]=}, {len(res)=}")

Results:

             a       b
0       847669  847669
1       777701  777701
2       446229  446229
3       984577  984577
4       813383  813383
...        ...     ...
999995  636811  636811
999996  413271  413271
999997  346275  346275
999998  414864  414864
999999  381832  381832

[1000000 rows x 2 columns]
list_comp
4.460098167066462
res[0:10]=[847669, 777701, 446229, 984577, 813383, 705699, 7830, 466819, 485673, 400344], len(res)=1000000
pandas_dict
4.201689457986504
res[0:10]=[847669, 777701, 446229, 984577, 813383, 705699, 7830, 466819, 485673, 400344], len(res)=1000000