Search code examples
pythonpython-3.xgarbage-collection

Why does python's "gc.collect()" not work as expected?


Here is my test code:

#! /usr/bin/python3
import gc
import ctypes

name = "a" * 50
name_id = id(name)
del name
gc.collect()
print(ctypes.cast(name_id, ctypes.py_object).value)

output:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

In my opinion, gc.collect() should clean the variable name and it's value,
but why can I get value with name_id after gc.collect() ?


Solution

  • You shouldn't expect gc.collect() to do anything here. gc simply controls the cyclic garbage collector, which is an auxilliary garbage collector because CPython uses reference counting for its main memory management strategy. The cyclic garbage collector handles reference cycles, there are no reference cycles here so gc.collect won't do anything.

    In my opinion, gc.collect() should clean the variable name and it's value,

    That is simply not how Python works. The variable ceased to exist with del name, but the object continues to exist, in this case, due to compiler optimizations. Python variables are not like C variables, they aren't chunks of memory, they are names that refer to objects in a particular namespace.

    In any case, disassembling the code will give you some insight here:

    In [1]: import dis
    
    In [2]: dis.dis("""
       ...: import gc
       ...: import ctypes
       ...:
       ...: name = "a" * 50
       ...: name_id = id(name)
       ...: del name
       ...: gc.collect()
       ...: print(ctypes.cast(name_id, ctypes.py_object).value)
       ...: """)
      2           0 LOAD_CONST               0 (0)
                  2 LOAD_CONST               1 (None)
                  4 IMPORT_NAME              0 (gc)
                  6 STORE_NAME               0 (gc)
    
      3           8 LOAD_CONST               0 (0)
                 10 LOAD_CONST               1 (None)
                 12 IMPORT_NAME              1 (ctypes)
                 14 STORE_NAME               1 (ctypes)
    
      5          16 LOAD_CONST               2 ('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
                 18 STORE_NAME               2 (name)
    
      6          20 LOAD_NAME                3 (id)
                 22 LOAD_NAME                2 (name)
                 24 CALL_FUNCTION            1
                 26 STORE_NAME               4 (name_id)
    
      7          28 DELETE_NAME              2 (name)
    
      8          30 LOAD_NAME                0 (gc)
                 32 LOAD_METHOD              5 (collect)
                 34 CALL_METHOD              0
                 36 POP_TOP
    
      9          38 LOAD_NAME                6 (print)
                 40 LOAD_NAME                1 (ctypes)
                 42 LOAD_METHOD              7 (cast)
                 44 LOAD_NAME                4 (name_id)
                 46 LOAD_NAME                1 (ctypes)
                 48 LOAD_ATTR                8 (py_object)
                 50 CALL_METHOD              2
                 52 LOAD_ATTR                9 (value)
                 54 CALL_FUNCTION            1
                 56 POP_TOP
                 58 LOAD_CONST               1 (None)
                 60 RETURN_VALUE
    

    So, when your code block was compiled, the CPython compiler noticed that "a"*50 could be turned into a constant, and so it did. It stores constants for code objects until that code object doesn't exist any more (in this case, when the interpreter exist). Since this code object will maintain a reference to this string object, it will exist the entire time.

    So, more explicitely:

    In [4]: code = compile("""name = "a" * 50""", filename='foo', mode='exec')
    
    In [5]: code
    Out[5]: <code object <module> at 0x7ff7c12495d0, file "foo", line 1>
    
    In [6]: code.co_consts
    Out[6]: ('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', None)
    

    Note also that Python memory management is complex and pretty opaque. All objects are handled on a privately managed heap. Just because an object is "released" doesn't mean that the runtime won't simply re-used that bit of memory for objects of the same type (or other suitable types) as needed. Look at this:

    In [1]: class Foo: pass
    
    In [2]: import ctypes
    
    In [3]: foo = Foo()
    
    In [4]: id(foo)
    Out[4]: 140559250737552
    
    In [5]: del foo
    
    In [6]: foo2 = Foo()
    
    In [7]: id(foo2)
    Out[7]: 140559250737680
    
    In [8]: ctypes.cast(140559250737552, ctypes.py_object).value
    Out[8]: <prompt_toolkit.lexers.pygments.RegexSync at 0x7fd68035c990>
    
    In [9]: id(foo2)
    Out[9]: 140559250737680
    
    In [10]: del foo2
    
    In [11]: ctypes.cast(140559250737680, ctypes.py_object).value
    Out[11]: <prompt_toolkit.lexers.pygments.PygmentsLexer at 0x7fd68035ca10>
    

    Notice how you are able to recover some objects in these cases, because the ipython interactive shell is creating objects all the time, and the internal heap is happy to re-use that memory.

    Look what happens in a more bare-bones REPL:

    (base) juanarrivillaga@50-254-139-253-static% python
    Python 3.7.9 (default, Aug 31 2020, 07:22:35)
    [Clang 10.0.0 ] :: Anaconda, Inc. on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import ctypes
    >>> class Foo: pass
    ...
    >>> foo = Foo()
    >>> i = id(foo)
    >>> del foo
    >>> ctypes.cast(i, ctypes.py_object).value
    zsh: segmentation fault  python
    

    So yeah. More what one might expect, we tried to access a part of memory that had been not only reclaimed by the internal heap, but freed by the Python process, and thus, we got a segmentation fault.