c++memory-management vector mapreduce bad-alloc

Why std::bad_alloc is thrown?

I'm implementing a map/reduce parallel project. However, using an input file of (more or less) 1GB, for a word count toy example, with only one mapper (which maps the whole file) i receive a std::bad_alloc exception. Unfortunately this happens only on the remote Xeon Phi (with smaller RAM), so no deep debugging.

However, the memory is occupied in 2 places: when the mapper read (store) the whole file in a char *:

void getNextKeyValue() {
    key = pos;//int
    value = new char[file_size];//file_size only with 1 mapper
    ssize_t result = pread(fd, value, file_size, pos);
    assert(result == ( file_size ) );
    morePairs = false;
}

And the other one when the map function is called and a series of pair<char*,int> is stored inside a vector as the map's result:

The map function:

std::function<void(int key, char *value,MapResult<int,char*,char*,int> *result)> map_func = [](int key,char *value,MapResult<int,char*,char*,int> *result) {
    const char delimit[]=" \t\r\n\v\f";
    char *token , *save;
    token = strtok_r(value, delimit, &save);
    while (token != NULL){
        result->emit(token,1);
        token = strtok_r (NULL,delimit, &save);
    }
};

emit implementation (and maps' result generation):

    void emit(char* key, int value) {
        res.push_back(pair<char*,int>(key,value));
    }
    ...
    private:
    vector<pair<char*,int>> res;

note: normally key and value in emit are template-based, but I omit them for claricity in this example.

In the first place I thought that std::bad_alloc was thrown because of char *value (which takes 1GB), but the exception was thrown after a testing cout message placed after the value allocation (so that'not the problem).

From what I've read about the strtok implementation, the original char* is modified (adding \0 at the end of each token) so no addition memory is allocated.

The only remaining possibility is the vector<pair<char*,int>> occupied space, but I'm not able to figure its space (please help me about it). Supposing an average word length of 5 chars, we should have ~ 2*10^8 words.

UPDATE AFTER 1201ProgramAlarm's answer : Unfortunately, precompute the number of words and then call resize() in order to eliminate unused vector's memory isn't feasible for two reasons:

It would greatly reduce the performance. Without calling emit and only counting the words of a file of 280MB, it take 1242ms over 1329ms total time execution (~5000s when file is read for the first time).
Using this solution, the final user should deeply consider memory usage when he writes the map function, which usually doesn't happen in classic map/reduce frameworks like Hadoop.

Solution

The problem isn't the space used by the vector, it's all the space previously used by the vector when its capacity was smaller. Unless you call reserve on the vector, it starts empty and allocates a small amount of space (typically large enough for one element) when you push the first element. During later pushes, if it doesn't have enough remaining space allocated, it will allocate more (1.5x or 2x the current size). This means you need enough free memory for both the smaller size and the larger one. Because the freed memory chunks, when combined, still won't be enough for the next larger requested amount, there can be a lot of free but unused memory.

You should call res.reserve(/*appropriate large size*/), or switch containers to a deque that, while it will need more space in the end, won't need to do the reallocations as it grows. To get the size to reserve, you can walk your file once to see how many words are really in it, reserve space for them, then walk it again and save the words.