Search code examples
c++windowsfilestat

fread is reading the last part of my text file twice


I have a text file I want to read into an std::vector. It's okay if the vector is a little too big, but it seems to be doing a very weird thing: It's copying the entire file, then copying a portion of the file near the end twice and appending it. (I think this might simply be garbage, but I don't know.)

So if the file looked like this (it's a pretty large txt file):

0kb        100kb       200kb      300kb   
 v          v           v           v
[1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ]

The copy in memory looks like this:

0kb        100kb       200kb      300kb   302kb
 v          v           v           v     v
[1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ TUVW]
                              ^^^^ this section is repeated at the end

I'm not entirely sure what's causing it, the code I wrote to perform this copy is this.

  • I first use stat to attain a size that can hold the file, in bytes. This might end up being larger due to how windows does line endings.
  • I allocate my memory.
  • Using fread() I copy the file into the vector in one shot.
void copyFile(std::vector<char> & output, const char * filename) {
    output.clear();
    FILE * file = fopen(filename, "r");
    if (!file)
        return;
    {
        struct stat statBuffer;
        stat(filename, &statBuffer);
        output.resize(statBuffer.st_size + 1);
        fread(output.data(), 1, statBuffer.st_size, file);
        output[statBuffer.st_size] = 0;  // make sure it's null terminated
    }
    fclose(file);
}

My theory is that fread() is reading past the end of the file and copying garbage? I am expecting fread() to read n bytes from the file, but perhaps that argument refers to n bytes outputted instead? These values would differ since it's reading 2 bytes for each newline, then outputting 1... But I can't find any information on this. Nor would I know how to handle that without breaking my read operation into a bunch of really tiny "getline()" commands. But maybe that's just necessary? Any help is appreciated.


Solution

  • You should always check the return values of I/O functions. One sufficient reason is to check for errors, but when fread might store fewer bytes than it reads (e.g., on Windows with files open in the default text mode), the return value is how you know how much was stored and thus how much of the buffer to use.

    The apparent repetition of data at the end of the buffer is evidence of an implementation strategy of reading into the buffer in binary mode and then shifting characters back to hide the carriage returns. This isn’t significant to a correct program, but it makes sense that the standard library would make use of the provided buffer this way.