Search code examples
pythonmemory-managementmemory-leaks

Issue with python memory management


I have some trouble explaining a memory footprint phenomenon I am seeing with some code that I wrote.
The code basically looks like this:

import datetime as dt
import psutil
import gc
from my_package import some_function
process = psutil.Process()
start_date = dt.datetime(2022, 5, 1, 0)
end_date = dt.datetime(2022, 7, 31, 18)
while start_date <= end_date:
    for hr in range(0,22,3):
        a = some_function(dt.datetime(start_date.year,start_date.month,start_date.day,hr))
    print(
        f"\n"
        f""
        f"Date processed: {start_date:%Y-%m-%d}\n"
        f"Memory consumption:\n"
        f"Resident memory (MB): {process.memory_info().rss / 1024**2}\n"
        f"Virtual memory (MB): {process.memory_info().vms / 1024**2}\n"
        f"Object count: {len(gc.get_objects())}"
    )
    gc.collect()
    start_date += dt.timedelta(days=1)

some_function initializes some objects and makes a call to the file system extracting a lot of filenames and returns a set of custom objects built from the retrieved filenames. Simplified it looks like this:

import datetime as dt
from pathlib import Path
class SomeClass:
    def __init__(self,*args):
        self.args = args
    @staticmethod
    def from_str(some_str):
        # do some stuff with some_str
        return SomeClass(arg1_derived_from_str,arg2_derived_from_str)
def some_function(timestamp:dt.datetime)->set[SomeClass]:
    timestamp_path = _derive_path(timestamp)
    filename_strings = {f.name for f in timestamp_path.glob("*")}
    return {SomeClass.from_str(fname) for fname in filename_strings}

When I run this code the object count remains constant through all iterations except the first two or so. The virtual memory however increases by 2MB after about half of the iterations are done. Since I had troubles with memory leaks in the package my_package, I am inspecting this very thoroughly.
What is striking is that when I replace the call to some_function by some filesystem call like pathlib.Path(f"/some/path/{start_date:%y/%m/%d}/{hr}").glob("*") the memory footprint shows a constant signature like I would expect given the unchanging object count. So this leaves me wondering whether the call to some_function introduces some memory leak that will pile up if I increase the number of iterations or whether the small increase I am observing is completely normal?


Solution

  • So...
    I apologize for the rather convoluted formulation of the question and the incomplete code snippet. It was my lack of understanding and the complexity of the code in SomeClass that made it hard for me to get to the point. I have now produced a rather small working example that illustrates what went wrong in my code. So my implementation of SomeClass had a nested dataclass defined like below:

    from dataclasses import dataclass
    import gc
    import psutil
    
    class SomeClass:
        def __init__(self, *args):
            self.args = args
    
        @staticmethod
        def from_str(some_str):
            return SomeClass(*SomeClass.disassemble(some_str).to_tuple())
    
        @staticmethod
        def disassemble(some_str):
            @dataclass
            class StringAttributes:
                attr1: str
    
                def to_tuple(self):
                    return (v for v in self.__dict__.values())
            return StringAttributes(some_str[0])
    
    
    def some_function(strings: set[str]) -> set[SomeClass]:
        return {SomeClass.from_str(s) for s in strings}
    

    When I now call some_function in a loop (the code snippet below is just in the same module as the code snippet above) and monitor the memory usage using psutil I observe that the virtual memory increases ever so slightly at irregular intervals (on my machine it happens once when counter reaches 1550 and another time when counter reaches 27650). The count of objects tracked by the garbage collector however stays constant from the second iteration onwards.

    counter = 0
    process = psutil.Process()
    while True:
        a = some_function({str(counter)})
        print(
            f"\n"
            f""
            f"Date processed: {counter}\n"
            f"Memory consumption:\n"
            f"Resident memory (MB): {process.memory_info().rss / 1024 ** 2}\n"
            f"Virtual memory (MB): {process.memory_info().vms / 1024 ** 2}\n"
            f"Object count: {len(gc.get_objects())}"
        )
        gc.collect()
        counter += 1
    

    When I move the definition of the dataclass outside of the definition of disassemble, then the virtual memory signature remains constant throughout the whole loop as far as I have tested it.

    from dataclasses import dataclass
    import gc
    import psutil
    
    @dataclass
    class StringAttributes:
        attr1: str
    
        def to_tuple(self):
            return (v for v in self.__dict__.values())  #
    
    
    class SomeClass:
        def __init__(self, *args):
            self.args = args
    
        @staticmethod
        def from_str(some_str):
            return SomeClass(*SomeClass.disassemble(some_str).to_tuple())
    
        @staticmethod
        def disassemble(some_str):
            return StringAttributes(some_str[0])
    
    
    def some_function(strings: set[str]) -> set[SomeClass]:
        return {SomeClass.from_str(s) for s in strings}
    

    Moving the dataclass definition outside of the disassemble definition thus solves my problem, but two questions still remain for me:

    • Why does this solve my problem, i.e. what exactly is staying in the process' memory when I define the dataclass in a nested way? I guess the problem is that I return an instance of the dataclass from the disassemble method and thus move it outside of the scope where it is defined.
    • Why does the increase in virtual memory not reflect in the length of the list of objects tracked by the garbage collector or in the list of unreachable objects (gc.garbage)?

    I put the refined question here Nested dataclass introducing memory leak, but gc.get_objects() has constant length