I have some trouble explaining a memory footprint phenomenon I am seeing with some code that I wrote.
The code basically looks like this:
import datetime as dt
import psutil
import gc
from my_package import some_function
process = psutil.Process()
start_date = dt.datetime(2022, 5, 1, 0)
end_date = dt.datetime(2022, 7, 31, 18)
while start_date <= end_date:
for hr in range(0,22,3):
a = some_function(dt.datetime(start_date.year,start_date.month,start_date.day,hr))
print(
f"\n"
f""
f"Date processed: {start_date:%Y-%m-%d}\n"
f"Memory consumption:\n"
f"Resident memory (MB): {process.memory_info().rss / 1024**2}\n"
f"Virtual memory (MB): {process.memory_info().vms / 1024**2}\n"
f"Object count: {len(gc.get_objects())}"
)
gc.collect()
start_date += dt.timedelta(days=1)
some_function
initializes some objects and makes a call to the file system extracting a lot of filenames and returns a set of custom objects built from the retrieved filenames. Simplified it looks like this:
import datetime as dt
from pathlib import Path
class SomeClass:
def __init__(self,*args):
self.args = args
@staticmethod
def from_str(some_str):
# do some stuff with some_str
return SomeClass(arg1_derived_from_str,arg2_derived_from_str)
def some_function(timestamp:dt.datetime)->set[SomeClass]:
timestamp_path = _derive_path(timestamp)
filename_strings = {f.name for f in timestamp_path.glob("*")}
return {SomeClass.from_str(fname) for fname in filename_strings}
When I run this code the object count remains constant through all iterations except the first two or so. The virtual memory however increases by 2MB after about half of the iterations are done. Since I had troubles with memory leaks in the package my_package
, I am inspecting this very thoroughly.
What is striking is that when I replace the call to some_function
by some filesystem call like pathlib.Path(f"/some/path/{start_date:%y/%m/%d}/{hr}").glob("*")
the memory footprint shows a constant signature like I would expect given the unchanging object count.
So this leaves me wondering whether the call to some_function
introduces some memory leak that will pile up if I increase the number of iterations or whether the small increase I am observing is completely normal?
So...
I apologize for the rather convoluted formulation of the question and the incomplete code snippet. It was my lack of understanding and the complexity of the code in SomeClass
that made it hard for me to get to the point. I have now produced a rather small working example that illustrates what went wrong in my code. So my implementation of SomeClass
had a nested dataclass defined like below:
from dataclasses import dataclass
import gc
import psutil
class SomeClass:
def __init__(self, *args):
self.args = args
@staticmethod
def from_str(some_str):
return SomeClass(*SomeClass.disassemble(some_str).to_tuple())
@staticmethod
def disassemble(some_str):
@dataclass
class StringAttributes:
attr1: str
def to_tuple(self):
return (v for v in self.__dict__.values())
return StringAttributes(some_str[0])
def some_function(strings: set[str]) -> set[SomeClass]:
return {SomeClass.from_str(s) for s in strings}
When I now call some_function
in a loop (the code snippet below is just in the same module as the code snippet above) and monitor the memory usage using psutil I observe that the virtual memory increases ever so slightly at irregular intervals (on my machine it happens once when counter reaches 1550 and another time when counter reaches 27650). The count of objects tracked by the garbage collector however stays constant from the second iteration onwards.
counter = 0
process = psutil.Process()
while True:
a = some_function({str(counter)})
print(
f"\n"
f""
f"Date processed: {counter}\n"
f"Memory consumption:\n"
f"Resident memory (MB): {process.memory_info().rss / 1024 ** 2}\n"
f"Virtual memory (MB): {process.memory_info().vms / 1024 ** 2}\n"
f"Object count: {len(gc.get_objects())}"
)
gc.collect()
counter += 1
When I move the definition of the dataclass outside of the definition of disassemble
, then the virtual memory signature remains constant throughout the whole loop as far as I have tested it.
from dataclasses import dataclass
import gc
import psutil
@dataclass
class StringAttributes:
attr1: str
def to_tuple(self):
return (v for v in self.__dict__.values()) #
class SomeClass:
def __init__(self, *args):
self.args = args
@staticmethod
def from_str(some_str):
return SomeClass(*SomeClass.disassemble(some_str).to_tuple())
@staticmethod
def disassemble(some_str):
return StringAttributes(some_str[0])
def some_function(strings: set[str]) -> set[SomeClass]:
return {SomeClass.from_str(s) for s in strings}
Moving the dataclass definition outside of the disassemble
definition thus solves my problem, but two questions still remain for me:
disassemble
method and thus move it outside of the scope where it is defined.gc.garbage
)?I put the refined question here Nested dataclass introducing memory leak, but gc.get_objects() has constant length