Search code examples
c++performancestlbuffering

read huge text file line by line in C++ with buffering


I need to read huge 35G file from disc line by line in C++. Currently I do it the following way:

ifstream infile("myfile.txt");
string line;
while (true) {
    if (!getline(infile, line)) break;
    long linepos = infile.tellg();
    process(line,linepos);
}

But it gives me about 2MB/sec performance, though file manager copies the file with 100Mb/s speed. I guess that getline() is not doing buffering correctly. Please propose some sort of buffered line-by-line reading approach.

UPD: process() is not a bottleneck, code without process() works with the same speed.


Solution

  • I've translated my own buffering code from my java project and it does what I need. I had to put defines to overcome problems with M$VC 2010 compiler tellg, that always gives wrong negative values on huge files. This algorithm gives desired speed ~100MB/s, though it does some usless new[].

    void readFileFast(ifstream &file, void(*lineHandler)(char*str, int length, __int64 absPos)){
            int BUF_SIZE = 40000;
            file.seekg(0,ios::end);
            ifstream::pos_type p = file.tellg();
    #ifdef WIN32
            __int64 fileSize = *(__int64*)(((char*)&p) +8);
    #else
            __int64 fileSize = p;
    #endif
            file.seekg(0,ios::beg);
            BUF_SIZE = min(BUF_SIZE, fileSize);
            char* buf = new char[BUF_SIZE];
            int bufLength = BUF_SIZE;
            file.read(buf, bufLength);
    
            int strEnd = -1;
            int strStart;
            __int64 bufPosInFile = 0;
            while (bufLength > 0) {
                int i = strEnd + 1;
                strStart = strEnd;
                strEnd = -1;
                for (; i < bufLength && i + bufPosInFile < fileSize; i++) {
                    if (buf[i] == '\n') {
                        strEnd = i;
                        break;
                    }
                }
    
                if (strEnd == -1) { // scroll buffer
                    if (strStart == -1) {
                        lineHandler(buf + strStart + 1, bufLength, bufPosInFile + strStart + 1);
                        bufPosInFile += bufLength;
                        bufLength = min(bufLength, fileSize - bufPosInFile);
                        delete[]buf;
                        buf = new char[bufLength];
                        file.read(buf, bufLength);
                    } else {
                        int movedLength = bufLength - strStart - 1;
                        memmove(buf,buf+strStart+1,movedLength);
                        bufPosInFile += strStart + 1;
                        int readSize = min(bufLength - movedLength, fileSize - bufPosInFile - movedLength);
    
                        if (readSize != 0)
                            file.read(buf + movedLength, readSize);
                        if (movedLength + readSize < bufLength) {
                            char *tmpbuf = new char[movedLength + readSize];
                            memmove(tmpbuf,buf,movedLength+readSize);
                            delete[]buf;
                            buf = tmpbuf;
                            bufLength = movedLength + readSize;
                        }
                        strEnd = -1;
                    }
                } else {
                    lineHandler(buf+ strStart + 1, strEnd - strStart, bufPosInFile + strStart + 1);
                }
            }
            lineHandler(0, 0, 0);//eof
    }
    
    void lineHandler(char*buf, int l, __int64 pos){
        if(buf==0) return;
        string s = string(buf, l);
        printf(s.c_str());
    }
    
    void loadFile(){
        ifstream infile("file");
        readFileFast(infile,lineHandler);
    }