I have a (long) list of dictionaries, but for the sake of this example I represent them as
d = [{'a':1}, {'a':2}, {'a':3}]
I need to extract the same element from these dictionaryes, i.e.,
[i['a'] for i in d]
What is the most efficient way to do this in Python? List comprehensions and for-loops work well, but are not known to be very efficient. Can the process be vectorized somehow?
Additional details: The dictionaries have multiple keys, but it is the same one that I need to extract. All the dictionaries have the same keys.
Use pandas
. You have to pay the upfront costs of import and creating a data frame. But the subsequent operations are vectorized and efficient:
import pandas as pd
d = [{'a':1, 'b':11}, {'a':2, 'b':12}, {'a':3, 'b':13}]
df = pd.DataFrame(d)
print(df['a'])
print(list(df['a']))
Prints:
0 1
1 2
2 3
Name: a, dtype: int64
[1, 2, 3]
Benchmarking:
The results is that pandas
data frame is slightly faster than the list of dictionaries for medium-size datasets, not including the cost of creating the data frame.
The benchmarking code is based on the answer by JL Peyret.
Note that, unlike in that answer, I place data frame initialization (not just import pandas
) outside the benchmarking loop. I benchmark simply access to the elements of the data structure. The data structure also has more rows (1 million). I assume that this scenario is an alternative realistic scenario for medium-size datasets.
import random
import pandas as pd
import timeit
def pandas_dict(datain):
return list(df["a"])
def list_comp(datain):
return [v["a"] for v in datain]
nrows = 1000000
rand = list(range(nrows))
random.shuffle(rand)
data = [{"a":rand[i], "b":rand[i]} for i in range(nrows)]
df = pd.DataFrame(data)
print(df)
results = []
for totest in [list_comp, pandas_dict]:
print(totest.__name__)
print(timeit.timeit(stmt='totest(data)', number=100, globals=globals()))
res = totest(data)
results.append(res)
print(f"{res[0:10]=}, {len(res)=}")
Results:
a b
0 847669 847669
1 777701 777701
2 446229 446229
3 984577 984577
4 813383 813383
... ... ...
999995 636811 636811
999996 413271 413271
999997 346275 346275
999998 414864 414864
999999 381832 381832
[1000000 rows x 2 columns]
list_comp
4.460098167066462
res[0:10]=[847669, 777701, 446229, 984577, 813383, 705699, 7830, 466819, 485673, 400344], len(res)=1000000
pandas_dict
4.201689457986504
res[0:10]=[847669, 777701, 446229, 984577, 813383, 705699, 7830, 466819, 485673, 400344], len(res)=1000000
See also: