I am working with quite big number of values in Python (memory footprint is 5GB).
Sometimes, I need to access values by key, sometimes I need to loop values. For performance reasons, I am converting the Dict to a List at startup, so I can:
my_big_dict_of_values
my_big_values_list = list(my_big_dict_of_values.values())
Here's a performance comparison, just for clarity:
>some_dict = dict(zip(range(1000000), reversed(range(1000000))))
>some_list = list(some_dict.values())
>%timeit for t in some_dict.values(): t
21.1 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>%timeit for t in some_list: t
16.1 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
My problem arises when I need to delete keys from the dict based on user input. First, I delete entries from the dict with:
for key in keys_to_remove:
del(my_big_dict_of_values[key])
After this operation, I also want to update my_big_values_list
. I can do this with:
Solution A (Slow)
indexes_to_remove = list()
for idx, value in enumerate(my_big_values_list):
if value.key in keys_to_remove:
indexes_to_remove.append(idx)
for index in sorted(indexes_to_remove, reverse=True):
del my_big_values_list[index]
However, this is really slow and cumbersome.
Ideally, I would like to just create the list from the dict again with:
Solution B (Fast with reference issue)
my_big_values_list = list(my_big_dict_of_values.values())
This is fast, but it seems to create a new reference. I would need to replace all references of my_big_values_list
passed to other classes/functions, which appears strange, e.g. to illustrate.
my_big_dict_of_values
my_big_values_list = list(
my_big_dict_of_values.values())
handle_process = handle_process_class(
my_big_dict_of_values, my_big_values_list)
userinput = userinput(handle_process)
handle_process.calculate()
def userinput_class():
def __init__(handle_process):
self.handle_process = handle_process
def user_del_key(key):
del(self.handle_process.my_big_dict_of_values[key])
# Update list here too:
# Solution A works
# Solution B throws error in
# handle_process.calculate() because
# handle_process still has old list
def handle_process_class():
def __init__(my_big_dict_of_values, my_big_values_list):
self.my_big_dict_of_values = my_big_dict_of_values
self.my_big_values_list = my_big_values_list
def calculate(self):
return len(self.my_big_values_list)
Is there a way to modify my_big_values_list
in place, but simply replace with a new list (e.g. list(my_big_dict_of_values.values())).
I have read how Python passes references to values, and I think I understand most of it. That's why I came up with solution A, but I don't know how to use Solution B to modify the referenced list. Perhaps someone can explain what is going on here?
To modify the list in-place, assign to its slice:
my_big_values_list[:] = list(my_big_dict_of_values.values())
Example:
>>> my_big_dict_of_values = {"a": 1, "b": 2, "c": 3}
>>> my_big_values_list = list(my_big_dict_of_values.values())
>>> another_list_reference = my_big_values_list
>>> print(my_big_values_list, another_list_reference)
[1, 2, 3] [1, 2, 3]
>>> del(my_big_dict_of_values["b"])
>>> my_big_values_list[:] = list(my_big_dict_of_values.values())
>>> print(my_big_values_list, another_list_reference)
[1, 3] [1, 3]
However in terms of performance and memory usage, you should consider if a separate huge list is really necessary, as you can loop directly over dictionary.values()
.