Search code examples
c++stringmemory

Why does massive string allocation seems to mess memory usage up


I'm currently trying to build a tool that needs to handle many strings.

When testing the tool I observed using top that memory usage of my program kept increasing over time. After a long analyses and the use or tools such as valgrind, I concluded that it might not be due to memory leaks of my own.

At first I thought it could be due to memory fragmentation but I investigated further and observed something I can't find any explanation about.


#include <string>

#include <iostream>

#include <unistd.h>

#include <array>

define SIZE_LITTLE 100

define SIZE_BIG 500000

struct Test {

  std::string data;

};

int main(){

  {

    {

      std::string line = "very long string, i tested using a 15Mb string";

      std::array<struct Test*, SIZE_LITTLE> arr;

      for (int i = 0; i < SIZE_LITTLE; i++){

        arr[i] = new Test{line};

      }

      std::cout << "Allocated" << std::endl;

      sleep(5);

      for (int i = 0; i < SIZE_LITTLE; i++){

        delete arr[i];

      }

      std::cout << "Memory released" << std::endl;

      sleep(5);

    }

    std::cout << "Memory released out of scope" << std::endl;

    sleep(5);

  }

  

  {

    {

      std::string line = "smaller string";

      std::array<struct Test*, SIZE_BIG> arr;

      for (int i = 0; i < SIZE_BIG; i++){

        arr[i] = new Test{line};

      }

      std::cout << "Allocated" << std::endl;

      sleep(5);

      for (int i = 0; i < SIZE_BIG; i++){

        delete arr[i];

      }

      std::cout << "Memory released" << std::endl;

      sleep(5);

    }

    std::cout << "Memory released out of scope" << std::endl;

    sleep(5);

  }

}

When I run this code and look at the memory consumption I cannot explain the behavior.

The first block of code seems to have the expected behavior. When sleeping after allocation, memory usage is high. After running the deletes, memory returns to near 0 and the array goes out of scope.

The second block on the other hand, have high memory usage after allocation, but it nevers goes down neither after deleting all the pointers nor going out of scope.

I also tested to run the second block twice in a row, and I expected to have higher memory consumption but no, it remains still from the begining until the end.

What am I missing ?


Solution

  • There are multiple levels of memory management. Guessing that you're on Windows, let me explain for Windows.

    The first layer is an abstraction of the hardware. That way you can't control the exact address of physical memory your program will access. You can imagine the chaos if every program would compete for the same physical memory. Instead, any pointer you have will refer to virtual memory. That way, any pointer is just for your program and not for other programs. This has other benefits as well, e.g. that you can access more memory than available as physical RAM.

    Virtual memory is managed in sizes of typically 64kB. This means, when you request 1 Byte from the virtual memory manager, it will actually allocate 64 kB and there's a lot of waste and it will look like a memory leak.

    To account for that waste, there's the Heap Manager as the second layer. The Heap Manager knows about that waste of virtual memory and deals with it. So when you request 1 Byte of memory, the Heap Manager will ask the Virtual Memory Manager for 64 kB and give you a part of that. When you request another 1 Byte later, it will give you another piece of what the Heap Manager already has. Thus, the waste is reduced. Only when the Heap Manager can't find free memory in the virtual memory it already owns, it'll request new virtual memory.

    Now, the thing is, if you give back one of the 2 Bytes, the Heap Manager can't give back the 64 kB to the Virtual Memory Manager yet, because you still need to be able to access that other byte, so it may still look like a leak.

    A third layer may exist inside your application. Say you allocate a buffer of 4096 Bytes and you fill only 350 of them. Then the remaining 3746 Bytes are "unused". Yet, neither the Heap Manager can know about this, nor can the Virtual Memory Manager, so that memory will appear as "in use".

    Now, why does it behave differently for different sizes of strings?

    Well, a small string like "smaller string" needs 14 bytes only. So it's a good idea to store many of those with the help of the Heap Manager. On the other hand side, if there are 65534 Bytes wasted in a 15 MB string, that's just 0.4% of waste, which might be acceptable.

    What the Heap Manager does: everything larger than 512 kB are directly forwarded to the Virtual Memory Manager. Stuff smaller than 512 kB is managed by the Heap Manager.

    Therefore: when you free the 15 MB string, the Heap Manager can directly ask the Virtual Memory Manager to free everything, since there's no memory shared with any other object. When you free a 14 Bytes string, it can't do so, because the memory may still be needed for other strings.

    Also, the Heap Manager may "cache" that virtual memory for future use, so it won't give it back at all but be prepared for future allocations of anything.

    This description leaves out many details, but should be enough for you to understand this problem and potential similar issues. Be prepared to understand that

    • things work a little different in debug build vs. release build
    • things run a little different under a debugger vs. without a debugger
    • things may be different on Linux
    • sizes I mentioned here may be different, e.g. on Itanium architectures
    • Heaps are managed in segments and segments are managed in blocks etc.

    So yeah, it's a bit complicated. Just trust C++, the Heap Manager and the Virtual Memory Manager that they do the right thing. Usually they do.