Search code examples
c++csviofixed-width

Is it possible to efficiently get a subset of rows from a large fixed-width CSV file?


I have an extremely large fixed-width CSV file (1.3 million rows and 80K columns). It's about 230 GB in size. I need to be able to fetch a subset of those rows. I have a vector of row indices that I need. However, I need to now figure out how to traverse such a massive file to get them.

The way I understand it, C++ will go through the file line by line, until it hits the newline (or a given delimiter), at which point, it'll clear the buffer, and then move onto the next line. I have also heard of a seek() function that can go to a given position in a stream. So is it possible to use this function somehow to get the pointer to the correct line number quickly?

I figured that since the program doesn't have to basically run billions of if statements to check for newlines, it might improve the speed if I simply tell the program where to go in the fixed-width file. But I have no idea how to do that.

Let's say that my file has a width of n characters and my line numbers are {l_1, l_2, l_3, ... l_m} (where l_1 < l_2 < l_3, ... < l_m). In that case, I can simply tell the file pointer to go to (l_1 - 1) * n, right? But then for the next line, do I calculate the next jump from the end of the l_1 line or from the beginning of the next line? And should I include the newlines when calculating the jumps?

Will this even help improve speed, or am I just misunderstanding something here?

Thanks for taking the time to help

EDIT: The file will look like this:

id0000001,AB,AB,AA,--,BB
id0000002,AA,--,AB,--,BB
id0000003,AA,AA,--,--,BB
id0000004,AB,AB,AA,AB,BB

Solution

  • As I proposed in the comment, you can compress your data field to two bits:

    -- 00
    AA 01
    AB 10
    BB 11
    

    That cuts your file size 12 times, so it'll be ~20GB. Considering that your processing is likely IO-bound, you may speed up processing by the same 12 times.

    The resulting file will have a record length of 20,000 bytes, so it will be easy to calculate an offset to any given record. No new line symbols to consider :)

    Here is how I build that binary file:

    #include <fstream>
    #include <iostream>
    #include <string>
    #include <chrono>
    
    int main()
    {
        auto t1 = std::chrono::high_resolution_clock::now();
        std::ifstream src("data.txt", std::ios::binary);
        std::ofstream bin("data.bin", std::ios::binary);
        size_t length = 80'000 * 3 + 9 + 2; // the `2` is a length of CR/LF on my Windows; use `1` for other systems
        std::string str(length, '\0');
        while (src.read(&str[0], length))
        {
            size_t pos = str.find(',') + 1;
            for (int group = 0; group < 2500; ++group) {
                uint64_t compressed(0), field(0);
                for (int i = 0; i < 32; ++i, pos += 3) {
                    if (str[pos] == '-')
                        field = 0;
                    else if (str[pos] == 'B')
                        field = 3;
                    else if (str[pos + 1] == 'B')
                        field = 2;
                    else
                        field = 1;
    
                    compressed <<= 2;
                    compressed |= field;
                }
                bin.write(reinterpret_cast<char*>(&compressed), sizeof compressed);
            }
        }
        auto t2 = std::chrono::high_resolution_clock::now();
        std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << std::endl;
    
        // clear `bad` bit set by trying to read past EOF
        src.clear();
        // rewind to the first record
        src.seekg(0);
        src.read(&str[0], length);
        // read next (second) record
        src.read(&str[0], length);
        // read forty second record from start (skip 41)
        src.seekg(41 * length, std::ios_base::beg);
        src.read(&str[0], length);
        // read next (forty third) record
        src.read(&str[0], length);
        // read fifties record (skip 6 from current position)
        src.seekg(6 * length, std::ios_base::cur);
        src.read(&str[0], length);
    
        return 0;
    }
    

    This can encode about 1,600 records in a second, so the whole file will take ~15 minutes. How long does it take you now to process it?

    UPDATE:

    Added example of how to read individual records from src.

    I only managed to make seekg() work in binary mode.