If I have a list:
list_1 = ["apples", "apricots", "oranges"]
and I append an new item to the list : "berries"
list_1 = ["apples", "apricots", "oranges", "berries"]
Under-the-hood (so to speak), I thought I remember reading that Python creates another list (list_2) and points it to the original list (list_1) so that list_1 remains static...if this is true, would it look something like this (under-the-hood)?
list_1 = ["apples", "apricots", ["oranges", "berries"]]
So in this way, the original list maintains its size. Is this the correct way of looking at it?
No, Python does not create another list when you call append
. It mutates the existing list in-place. You can see this pretty easily:
>>> lst1 = []
>>> lst2 = lst1
>>> lst1.append(0)
>>> lst1
[0]
>>> lst2
[0]
If you want to create another list, you can do this instead:
>>> lst1 = []
>>> lst2 = lst1
>>> lst1 = lst1 + [0]
>>> lst1
[0]
>>> lst2
[]
So, how does that in-place appending work? Aren't lists just arrays under the hood? Yes, they are. Python leaves a little space at the end, but if you append
enough times, it has to allocate a new array for the list, move over all the elements, and delete the old one. It's still the same list object, but with a different array under the hood.
That growing doesn't just add one new slot each time—that would mean each append
has to reallocate the whole list, so appending would take average linear time. Instead, it multiplies the length. Something like this:
new_capacity = max(4, capacity * 8 // 5, new_length)
(The new_length
is there in case you're extend
ing the list with a whole bunch of elements at once.)
By expanding geometrically rather than arithmetically, we can guarantee that, while a few append
s do take linear time, enough of them are instant that the amortized time is constant. Exactly what factor you use is a tradeoff between speed (high numbers mean fewer reallocations) and space (higher numbers mean more wasted space on the end). I don't know what CPython does, but you can find it in the source code linked below. Most systems use a value between 1.5 and 2.0 (and usually a nice fraction of small numbers so they can do integer multiple and divide).
If you really want to understand this, and you can follow basic C, you can look under the hood at listobject.h
and listobject.c
. You'll probably want to read the C API docs first, but here's the basics (in Python-like pseudocode, and intentionally using not quite the real function and field names):
if lst.size + 1 > lst.allocated:
new_capacity = <see above>
lst.array = PyRealloc(<enough memory for new_capacity pointers>)
lst.allocated = new_capacity
incref(new_item)
lst.array[lst.size] = new_item
lst.size += 1
The Realloc
function is going to be a thin wrapper around the platform's function, which will try to find more room in-place, but fall back to allocating a totally new pointer and moving over all of the contents.
Since you're using Python, there's a good chance you're the kind of person who likes to learn through interactive experimentation. If you don't know about ctypes.pythonapi
. you should definitely start playing with it. You can call almost anything from the C API from inside Python. Unfortunately, you can't call #define
macros, or dig into the structs without a bit of extra work—but see superhackyinternals
for how you can do that bit of extra work. (I don't think I included anything there for lists, but look at how ints work, and you should be able to get it from there—just don't look at strings, because they're a lot more complicated.) Of course playing around with this stuff from inside your interpreter, you're going to segfault a lot, so don't do it in a session where you've got any important history.
And of course that isn't guaranteed to be true for every Python implementation. As long as an implementation can provide the documented interface and performance characteristics, it can build lists however it wants. For example, maybe IronPython uses some vector class in the .NET class library. Of course that class will do similar reallocate-and-move under its own hood, but IronPython won't care how it does that (and you'll care even less).