Search code examples
ctext-filesfopen

Open non text file without windows line ending


I took over a project that use the following function to read files:

char *fetchFile(char *filename) {
    char *buffer;
    int len;
    FILE *f = fopen(filename, "rb");
    if(f) {
        if(verbose) {
            fprintf(stdout, "Opened file %s successfully\n", filename);
        }
        fseek(f, 0, SEEK_END);
        len = ftell(f);
        fseek(f, 0, SEEK_SET);
        if(verbose) {
            fprintf(stdout, "Allocating memory for buffer for %s\n", filename);
        }
        buffer = malloc(len + 1);
        if(buffer) fread (buffer, 1, len, f);
        fclose (f);
        buffer[len] = '\0';
    } else {
        fprintf(stderr, "Error reading file %s\n", filename);
        exit(1);
    }
    return buffer;
}

The rb mode is used because sometimes the file can be a spreadsheet and therefore I want the information as in a text file.

The program runs on a linux machine but the files to read come from linux and windows.

I am not sure of what approach is better to not have windows line ending mess with my code.

I was thinking of using dos2unix at the start of this function. I also thought of opening in r mode, but I believe that could potentially mess things up when opening non-text files.

I would like to understand better the differences between using:

  1. dos2unix,
  2. r vs rb mode,
  3. or any other solution which would fit better the problem.

Note: I believe that I understand r vs rb modes, but if you could explain why it is a bad or good solution for this specific situation (I think it wouldn't be good because sometimes it opens spreadsheets but I am not sure of that).


Solution

  • If my understanding is correct the rb mode is used because sometimes the file can be a spreadsheet and therefore the programs just want the information as in a text file.

    You seem uncertain, and though perhaps you do understand correctly, your explanation does not give me any confidence in that.

    C knows about two distinct kinds of streams: binary streams and text streams. A binary stream is simply an ordered sequence of bytes, written and / or read as-is without any kind of transformation. On the other hand,

    A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character. Whether the last line requires a terminating new-line character is implementation-defined. Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment. Thus, there need not be a one- to-one correspondence between the characters in a stream and those in the external representation. [...]

    (C2011 7.21.2/2)

    For some implementations, such as POSIX-compliant ones, this is a distinction without a difference. For other implementations, such as those targeting Windows, the difference matters. In particular, on Windows, text streams convert on the fly between carriage-return / line-feed pairs in the external representation and newlines (only) in the internal representation.

    The b in your fopen() mode specifies that the file should be opened as a binary stream -- that is, no translation will be performed on the bytes read from the file. Whether this is the right thing to do depends on your environment and the application's requirements. This is moot on Linux or another Unix, however, as there is no observable difference between text and binary streams on such systems.

    dos2unix converts carriage-return / line-feed pairs in the input file to single line-feed (newline) characters. This will convert a Windows-style text file or one with mixed Windows / Unix line terminators to Unix text file convention. It is irreversible if there are both Windows-style and Unix-style line terminators in the file, and it is furthermore likely to corrupt your file if it is not a text file in the first place.

    If your inputs are sometimes binary files then opening in binary mode is appropriate, and conversion via dos2unix probably is not. If that's the case and you also need translation for text-file line terminators, then you first and foremost need a way to distinguish which case applies for any particular file -- for example, by command-line argument or by pre-analyzing the file via libmagic. You then must provide different handling for text files; your main options are

    1. Perform the line terminator conversion in your own code.
    2. Provide separate versions of the fetchFile() function for text and binary files.