Search code examples
c++linuxboostmmapexternal-sorting

Efficiently reading a very large text file in C++


I have a very large text file(45GB). Each line of the text file contains two space separated 64bit unsigned integers as shown below.

4624996948753406865 10214715013130414417

4305027007407867230 4569406367070518418

10817905656952544704 3697712211731468838 ... ...

I want to read the file and perform some operations on the numbers.

My Code in C++:

void process_data(string str)
{
    vector<string> arr;
    boost::split(arr, str, boost::is_any_of(" \n"));
    do_some_operation(arr);
}

int main()
{
    unsigned long long int read_bytes = 45 * 1024 *1024;
    const char* fname = "input.txt";
    ifstream fin(fname, ios::in);
    char* memblock;

    while(!fin.eof())
    {
        memblock = new char[read_bytes];
        fin.read(memblock, read_bytes);
        string str(memblock);
        process_data(str);
        delete [] memblock;
    }
    return 0;
}

I am relatively new to c++. When I run this code, I am facing these problems.

  1. Because of reading the file in bytes, sometimes the last line of a block corresponds to an unfinished line in the original file("4624996948753406865 10214" instead of the actual string "4624996948753406865 10214715013130414417" of the main file).

  2. This code runs very very slow. It takes around 6secs to run for one block operations in a 64bit Intel Core i7 920 system with 6GB of RAM. Is there any optimization techniques that I can use to improve the runtime?

  3. Is it necessary to include "\n" along with blank character in the boost split function?

I have read about mmap files in C++ but I am not sure whether it's the correct way to do so. If yes, please attach some links.


Solution

  • I'd redesign this to act streaming, instead of on a block.

    A simpler approach would be:

    std::ifstream ifs("input.txt");
    std::vector<uint64_t> parsed(std::istream_iterator<uint64_t>(ifs), {});
    

    If you know roughly how many values are expected, using std::vector::reserve up front could speed it up further.


    Alternatively you can use a memory mapped file and iterate over the character sequence.

    Update I modified the above program to parse uint32_ts into a vector.

    When using a sample input file of 4.5GiB[1] the program runs in 9 seconds[2]:

    sehe@desktop:/tmp$ make -B && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test smaller.txt
    g++ -std=c++0x -Wall -pedantic -g -O2 -march=native test.cpp -o test -lboost_system -lboost_iostreams -ltcmalloc
    parse success
    trailing unparsed: '
    '
    data.size():   402653184
    0:08.96 elapsed, 6 context switches
    

    Of course it allocates at least 402653184 * 4 * byte = 1.5 gibibytes. So when you read a 45 GB file, you will need an estimated 15GiB of RAM to just store the vector (assuming no fragmentation on reallocation): The 45GiB parse completes in 10min 45s:

    make && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test 45gib_uint32s.txt 
    make: Nothing to be done for `all'.
    tcmalloc: large alloc 17570324480 bytes == 0x2cb6000 @  0x7ffe6b81dd9c 0x7ffe6b83dae9 0x401320 0x7ffe6af4cec5 0x40176f (nil)
    Parse success
    Trailing unparsed: 1 characters
    Data.size():   4026531840
    Time taken by parsing: 644.64s
    10:45.96 elapsed, 42 context switches
    

    By comparison, just running wc -l 45gib_uint32s.txt took ~12 minutes (without realtime priority scheduling though). wc is blazingly fast

    Full Code Used For Benchmark

    #include <boost/spirit/include/qi.hpp>
    #include <boost/iostreams/device/mapped_file.hpp>
    #include <chrono>
    
    namespace qi = boost::spirit::qi;
    
    typedef std::vector<uint32_t> data_t;
    
    using hrclock = std::chrono::high_resolution_clock;
    
    int main(int argc, char** argv) {
        if (argc<2) return 255;
        data_t data;
        data.reserve(4392580288);   // for the  45 GiB file benchmark
        // data.reserve(402653284); // for the 4.5 GiB file benchmark
    
        boost::iostreams::mapped_file mmap(argv[1], boost::iostreams::mapped_file::readonly);
        auto f = mmap.const_data();
        auto l = f + mmap.size();
    
        using namespace qi;
    
        auto start_parse = hrclock::now();
        bool ok = phrase_parse(f,l,int_parser<uint32_t, 10>() % eol, blank, data);
        auto stop_time = hrclock::now();
    
        if (ok)   
            std::cout << "Parse success\n";
        else 
            std::cerr << "Parse failed at #" << std::distance(mmap.const_data(), f) << " around '" << std::string(f,f+50) << "'\n";
    
        if (f!=l) 
            std::cerr << "Trailing unparsed: " << std::distance(f,l) << " characters\n";
    
        std::cout << "Data.size():   " << data.size() << "\n";
        std::cout << "Time taken by parsing: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop_time-start_parse).count() / 1000.0 << "s\n";
    }
    

    [1] generated with od -t u4 /dev/urandom -A none -v -w4 | pv | dd bs=1M count=$((9*1024/2)) iflag=fullblock > smaller.txt

    [2] obviously, this was with the file cached in the buffer cache on linux - the large file doesn't have this benefit