c++optimization large-data large-files file-read

Accelerate reading/processing log files in C++

I am trying to read a large log files (around 10G) with the following format:

3 49
7 28 
...

Basically we have two columns, each column is an positive integer.

And here is the code I am using to read the log files into a two-level map:

typedef std::map<int, std::map<int, std::tuple<int, int, int>>> coverage_info;

coverage_info cmap;
...

      fp = fopen("log.txt", "r");

      while (fgets(line, len, fp)) {
        sscanf(line, "%d %d", &func_id, &bb_id);

        auto it = cmap.find(func_id);
        if (it != cmap.end()) {
          auto it1 = (it->second).find(bb_id);
          if (it1 != (it->second).end()) {
            auto c =std::get<0>(cmap[func_id][bb_id]);
            cmap[func_id][bb_id] = std::make_tuple(++c, 0, 0);
          }
          else 
            cmap[func_id][bb_id] = std::make_tuple(1, 0, 0);
        }
        else {
          std::unordered_map<int, std::tuple<int, int, int>> m;
          m[bb_id] = std::make_tuple(1, 0, 0);
          cmap[func_id] = m;
        }
      }

When reading a 7G log files, the above code takes about 30 mins to finish, which is too slow for me.

I have thought about the following ways of optimization:

use unordered_map instead of map to speedup the find from log(n) to O(1). And this reduces the processing time from 30mins to about 15 mins for my case.
maybe sort the log file first --> but since using unordered_map is a good option for me, sorting the log file does not make too much sense right now?
try to read a buffer (e.g., 10M) instead of one line per fgets? Maybe I can use a std::ifstream or something with a large buffer?
use mmap? But it's a good idea for me to be "platform-independent"..
compress the coverage_info map design by using the first and second columns together into a a key instead of two-level map. But again, how come this can speed up an already O(1) table query?

What else? Could anyone shed some lights on potential optimization options? Any suggestion would be appreciated.

Solution

First you should check which part of the code takes most time. Probably it is either fgets/sscanf pair, or dealing with maps. You should do this research before you ask the question here. How? Remove everything except sscanf from the loop, and see how much it takes now (you may want to read the smaller file, say 100M, for the test purpose). Or use some profiling tool. Then you should concentrate on the part which takes almost all the time.
Why do you read the line then parse it? Why not fscanf? You loop would start like this:

while (true) { int rc = fscanf(fp, "%d %d", &func_id, &bb_id); if (rc != 2) break;
Yes, try using istream instead of file.
You have map of maps of tuples whose second and third elements are always 0. Why tuples, then? You should replace them to ints (std::map<std::map<int>>)

4a. After you use ints, you can simplify your whole loop to

while (fgets(line, len, fp)) {
    sscanf(line, "%d %d", &func_id, &bb_id);
    ++cmap[func_id][bb_id];
}

or, with fscanf,

while (true) {
    int rc = fscanf(fp, "%d %d", &func_id, &bb_id);
    if (rc != 2)
        break;
    ++cmap[func_id][bb_id];
}

It is because std::map::operator[] creates the element if it doesn't find one, and initializes it. Numbers are initialized to 0.

You have map of maps, but here:

   std::unordered_map<int, std::tuple<int, int, int>> m;
   m[bb_id] = std::make_tuple(1, 0, 0);
   cmap[func_id] = m;

you assing unordered_map. So it should convert unordered_map to map.

This thing

     auto c =std::get<0>(cmap[func_id][bb_id]);
     cmap[func_id][bb_id] = std::make_tuple(++c, 0, 0);

seems pretty inefficient, and should be simplified to ++std::get<0>(cmap[func_id][bb_id]) .

Do you compile with optimisations? (-O3 if you use Linux and g++).
Check if unordered_map is better than map. Check all variants: map of maps, map of unordered_maps, unordered_map of maps, unordered_map of unordered_maps.
If both numbers are 32-bit int, consider making one unsigned long from them (z = ((unsigned long) x << 32) + y, where x and y are unsigned longs). If you know that 0<=x<65536, 0<=y<65536, you might use x << 16 + y, while x and y are unsigned
After numerical sorting (sort -d, if I remember correctly), it might run faster, because it would use cache memory better. But sorting itself would take time.