Search code examples
pythonmemorydictionarysetdefault

Python dict.setdefault uses more memory?


I was writing some Python code that involved something like this

values = {}
for element in iterable:
    values.setdefault(element.name, []).append(element)

Because I could have sorted the input previously, I also implemented it like this

values = {}

cur_name = None
cur_list = None

for element in iterable:
    if element.name != cur_name:
        values[cur_name] = cur_list
        cur_name = element.name
        cur_list = []
    cur_list.append(element)
if cur_list:
    values[cur_name] = cur_list
del values[None]

Here the input is already sorted by element.name.

The second approach was much faster than the first approach, and it also used less memory.

What's the reason for this?

Or have I made some sort of mistake in the second approach?


Solution

  • Your original code every time round the loop will create a list that mostly then just gets thrown away. It also makes multiple dictionary lookups (looking up the method setdefault is a dictionary lookup and then the method itself does a dictionary lookup to see whether the object was set and if it isn't does another to store the value). .name and .append() are also dictionary lookups but they are still present in the revised code.

    for element in iterable:
        values.setdefault(element.name, []).append(element)
    

    The revised code only looks up the dictionary when the name changes, so it it removes two dictionary lookups and a method call from every loop. That's why it's faster.

    As for the memory use, when the list grows it may sometimes have to copy the data but can avoid that if the memory block can just be expanded. My guess would be that creating all of those unused temporary lists may be fragmenting the memory more and forcing more copies. In other words Python isn't actually using more memory, but it may have more allocated but unused memory.

    When you feel a need for setdefault consider using collections.defaultdict instead. That avoids creating the list except when it's needed:

    from collections import defaultdict
    values = defaultdict(list)
    for element in iterable:
        values[element.name].append(element)
    

    That will probably still be slower than your second code because it doesn't take advantage of your knowledge that names are all grouped, but for the general case it is better than setdefault.

    Another way would be to use itertools.groupby. Something like this:

    from itertools import groupby
    from operator import attrgetter
    values = { name: list(elements) for name,elements in
        groupby(elements, attrgetter('name')) }
    

    That takes advantage of the ordering and simplifies everything down to a single dictionary comprehension.