I'm implementing a map/reduce parallel project. However, using an input file of (more or less) 1GB, for a word count toy example, with only one mapper (which maps the whole file) i receive a std::bad_alloc
exception. Unfortunately this happens only on the remote Xeon Phi (with smaller RAM), so no deep debugging.
However, the memory is occupied in 2 places: when the mapper read (store) the whole file in a char *
:
void getNextKeyValue() {
key = pos;//int
value = new char[file_size];//file_size only with 1 mapper
ssize_t result = pread(fd, value, file_size, pos);
assert(result == ( file_size ) );
morePairs = false;
}
And the other one when the map
function is called and a series of pair<char*,int>
is stored inside a vector
as the map's result:
The map function:
std::function<void(int key, char *value,MapResult<int,char*,char*,int> *result)> map_func = [](int key,char *value,MapResult<int,char*,char*,int> *result) {
const char delimit[]=" \t\r\n\v\f";
char *token , *save;
token = strtok_r(value, delimit, &save);
while (token != NULL){
result->emit(token,1);
token = strtok_r (NULL,delimit, &save);
}
};
emit
implementation (and maps' result generation):
void emit(char* key, int value) {
res.push_back(pair<char*,int>(key,value));
}
...
private:
vector<pair<char*,int>> res;
note: normally key
and value
in emit
are template-based, but I omit them for claricity in this example.
In the first place I thought that std::bad_alloc
was thrown because of char *value
(which takes 1GB), but the exception was thrown after a testing cout
message placed after the value
allocation (so that'not the problem).
From what I've read about the strtok
implementation, the original char*
is modified (adding \0
at the end of each token) so no addition memory is allocated.
The only remaining possibility is the vector<pair<char*,int>>
occupied space, but I'm not able to figure its space (please help me about it). Supposing an average word length of 5 chars, we should have ~ 2*10^8 words.
UPDATE AFTER 1201ProgramAlarm's answer :
Unfortunately, precompute the number of words and then call resize()
in order to eliminate unused vector's memory isn't feasible for two reasons:
emit
and only counting the words of a file of 280MB, it take 1242ms over 1329ms total time execution (~5000s when file is read for the first time).The problem isn't the space used by the vector
, it's all the space previously used by the vector when its capacity was smaller. Unless you call reserve
on the vector, it starts empty and allocates a small amount of space (typically large enough for one element) when you push the first element. During later pushes, if it doesn't have enough remaining space allocated, it will allocate more (1.5x or 2x the current size). This means you need enough free memory for both the smaller size and the larger one. Because the freed memory chunks, when combined, still won't be enough for the next larger requested amount, there can be a lot of free but unused memory.
You should call res.reserve(/*appropriate large size*/)
, or switch containers to a deque
that, while it will need more space in the end, won't need to do the reallocations as it grows. To get the size to reserve, you can walk your file once to see how many words are really in it, reserve space for them, then walk it again and save the words.