Search code examples
c++memoryvectorlarge-dataallocation

How to solve the issue of not enough memory in C++ when vector's size is too large?


To test if two vectors are the same or not

See code below

#include <iostream>
#include <vector>
#include <string>
int main(void) {
    std::vector<std::string> vstr1(131, "asdf");
    std::vector<std::string> vstr2(33131, "asdf");

    std::cout << (vstr1 == vstr2) << std::endl;;
    std::cout << "******************************" << std::endl;

    return 0;
}

It works fine.

Now I change the size of vstr2 to be extremely large, like 33333333333131

#include <iostream>
#include <vector>
#include <string>
int main(void) {
    std::vector<std::string> vstr1(131, "asdf");
    std::vector<std::string> vstr2(33333333333131, "asdf");

    std::cout << (vstr1 == vstr2) << std::endl;;
    std::cout << "******************************" << std::endl;

    return 0;
}

Not working, and error message is

terminate called after throwing an instance of 'std::bad_alloc' 
what()  std::bad_alloc
Aborted    (core dumped)     a.out

I have gathered the error occurs due to memory allocation failure.

What can I do to deal with vector of really large size ?


Solution

  • You've got two problems here:

    1) If each std::string only required only 1 byte of RAM to store, your vector of 33333333333131 strings would require around 30 terabytes of data. In reality, each std:string requires a few dozen bytes (at least), so the requirements will be even larger than that. You're unlikely to have that much RAM (or swap space) available on your computer.

    2) If your computer is running in 32-bit mode, you (typically) are limited to less than 2^32 bytes (aka 4 gigabytes) of address space for your process -- maybe a bit less, or maybe a few gigabytes more if your computer has some special paging tricks enabled. So in that case, even if you did have 30+TB of RAM, you wouldn't be able to access all of it directly.

    As for how to deal with very large data structures like your tera-vector, the usual threshold to consider is whether or not you expect to have enough physical RAM installed to fit all of the data into RAM. If you do, then great -- just use a vector (or whatever in-memory data structure you like) and have at it. If not, you might still get away with an in-memory data structure if you've configured enough swap space to hold the data, but since disk I/O is vastly slower than RAM, you might find that to be too slow for your purposes.

    If you can't fit the data structure into RAM, then you have a few options:

    1. Split the data structure into manageable-sized smaller chunks, and process only one chunk at a time, rather than trying to hold the entire thing at once

    2. Hold the data on disk instead of in RAM, and load in just part of it at a time, operate on it, and write out the results. (this is really just a variation of (1))

    3. Split the task up across multiple computers, and have each computer operate on just part of the task in parallel. Keep adding computers until you have enough computers to handle the job adequately.

    4. Step back and re-consider the problem you are trying to solve. Does it really require that much data to be stored? Why? Is there any possible alternative way to approach the problem that would reduce the storage requirements? If you need to store terabytes of data, you probably either know exactly what you're doing (and therefore probably wouldn't be asking questions about it on StackOverflow), or you're doing something grossly inefficient.

    5. (Last resort) Buy more storage hardware. You can get a computer with 30 terabytes of disk storage and half a terabyte of RAM for only about the cost of a new Ferrari these days, so knock yourself out! :)