Search code examples
rustio

Detecting EOF without 0-byte read in Rust


I've been working on some code that reads data from a Read type (the input) in chunks and does some processing on each chunk. The issue is that the final chunk needs to be processed with a different function. As far as I can tell, there's a couple of ways to detect EOF from a Read, but none of them feel particularly ergonomic for this case. I'm looking for a more idiomatic solution.

My current approach is to maintain two buffers, so that the previous read result can be maintained if the next read reads zero bytes, which indicates EOF in this case, since the buffer is of non-zero length:

use std::io::{Read, Result};

const BUF_SIZE: usize = 0x1000;

fn process_stream<I: Read>(mut input: I) -> Result<()> {
    // Stores a chunk of input to be processed
    let mut buf = [0; BUF_SIZE];
    let mut prev_buf = [0; BUF_SIZE];
    let mut prev_read = input.read(&mut prev_buf)?;

    loop {
        let bytes_read = input.read(&mut buf)?;
        if bytes_read == 0 {
            break;
        }

        // Some function which processes the contents of a chunk
        process_chunk(&prev_buf[..prev_read]);

        prev_read = bytes_read;
        prev_buf.copy_from_slice(&buf[..]);
    }

    // Some function used to process the final chunk differently from all other messages
    process_final_chunk(&prev_buf[..prev_read]);
    Ok(())
}

This strikes me as a very ugly way to do this, I shouldn't need to use two buffers here.

An alternative I can think of would be to impose Seek on input and use input.read_exact(). I could then check for an UnexpectedEof errorkind to determine that we've hit the end of input, and seek backwards to read the final chunk again (the seek & read again is necessary here because the contents of the buffer are undefined in the case of an UnexpectedEof error). But this doesn't seem idiomatic at all: Encountering an error, seeking back, and reading again just to detect we're at the end of a file is very strange.

My ideal solution would be something like this, using an imaginary input.feof() function that returns true if the last input.read() call reached EOF, like the feof syscall in C:

fn process_stream<I: Read>(mut input: I) -> Result<()> {
    // Stores a chunk of input to be processed
    let mut buf = [0; BUF_SIZE];
    let mut bytes_read = 0;

    loop {
        bytes_read = input.read(&mut buf)?;

        if input.feof() {
            break;
        }

        process_chunk(&buf[..bytes_read]);
    }

    process_final_chunk(&buf[..bytes_read]);
    Ok(())
}

Can anyone suggest a way to implement this that is more idiomatic? Thanks!


Solution

  • When read of std::io::Read returns Ok(n), not only does that mean that the buffer buf has been filled in with n bytes of data from this source., but it also indicates that the bytes after index n (inclusive) are left untouched. With this in mind, you actually don't need a prev_buf at all, because when n is 0, all bytes of the buffer would be left untoutched (leaving them to be those bytes of the previous read).

    prog-fh's solution is what you want to go with for the kind of processing you want to do, because it will only hand off full chunks to process_chunk. With read potentially returning a value between 0 and BUF_SIZE, this is needed. For more info, see this part of the above link:

    It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.

    However, I advise that you think about what should happen when you get a Ok(0) from read that does not represent end of file forever. See this part:

    If n is 0, then it can indicate one of two scenarios:

    1. This reader has reached its “end of file” and will likely no longer be able to produce bytes. Note that this does not mean that the reader will always no longer be able to produce bytes.

    So if you were to get a sequence of reads that returned Ok(BUF_SIZE), Ok(BUF_SIZE), 0, Ok(BUF_SIZE) (which is entirely possible, it just represents a hitch in the IO), would you want to not consider the last Ok(BUF_SIZE) as a read chunk? If you treat Ok(0) as EOF forever, that may be a mistake here.

    The only way to reliably determine what should be considered as the last chunk is to have the expected length (in bytes, not # of chunks) sent beforehand as part of the protocol. Given a variable expected_len, you could then determine the start index of the last chunk through expected_len - expected_len % BUF_SIZE, and the end index just being expected_len itself.