Search code examples
pythonpandasmemory

Python - How to utilise more memory


I have a Panads code that runs for over 1hr to be able to churn out the output. When i ran it initially my comp had 16GB of RAM and the task manager showed that around 80% of Memory was being used when the script was running. I just now upgraded my RAM to 64GB. However now when i run the code there is no improvement. The system does not use all the memory available. Right it is showing that memory useage is around 30% and CPU useage is around 10%.

How can i increase the usage of memory to expedite the script execution


Solution

  • It's not so easy to understand how you think that providing more memory will make dataframe operations run faster. If you were at 80% utilisation then it didn't seem to be a barrier.

    However, Linux has swap space which could, justifiably, be slowing down the computation. It could be that the OS has decided that it has to start dumping data and then reading it back from disk. I don't think that's the issue here, but you could read that info from something like htop. The fact that upgrading your RAM hasn't changed processing speed doesn't suggest that this was the issue, though.

    RAM is a potential limitation in processing speed, but its increase doesn't directly improve processing speed. How could it? It's the CPU that controls processing speed.

    Your CPU utilisation is low, and that's what affects speed. That's because a lot of processes in pandas are single-threaded. Python has the GIL which will be a real limitation. There are lots of ways of getting around this in compiled extensions (which pandas, and the libraries it's built on, sometimes use) but it's hard to know a priori whether a particular operation you're using actually does this.

    polars seeks to address this in a number of ways. It multi-threaded a lot of its operations (to get around the GIL where pandas doesn't) and it also has "lazy evaluation" which does the opposite of using more RAM by only selecting data that you actually need for your calculation, rather than loading the whole lot up-front.

    None of this information will help you if you designed your code with some pathological loop that completely hamstrings the libraries from the optimisations, but hopefully it gives you some pointers on how to proceed.