Search code examples
c++ioistreamistream-iterator

Implement reading from stream via copy


I have a class which represents a character sequence and I’d like to implement an operator >> for it. My implementation currently looks like this:

inline std::istream& operator >>(std::istream& in, seq& rhs) {
    std::copy(
        std::istream_iterator<char>(in),
        std::istream_iterator<char>(),
        std::back_inserter(rhs));
    // `copy` doesn't know when to stop reading so it always also sets `fail`
    // along with `eof`, even if reading succeeded. On the other hand, when
    // reading actually failed, `eof` is not going to be set.
    if (in.fail() and in.eof())
        in.clear(std::ios_base::eofbit);
    return in;
}

However, the following predictably fails:

std::istringstream istr("GATTACA FOO");
seq s;
assert((istr >> s) and s == "GATTACA");

In particular, once we reach the space in “GATTACA FOO”, the copying stop (expected) and sets the failbit on the istream (also expected). However, the read operation actually succeeded as far as seq is concerned.

Can I model this at all using std::copy? I also thought of using an istreambuf_iterator instead but this doesn’t actually solve this particular problem.

What’s more, a read operation on the input “GATTACAFOOshould fail since that input doesn’t represent a valid DNA sequence (which is what my class represents). On the other hand, reading an int from the input 42foo actually succeeds in C++ so maybe I should consider every valid prefix as a valid input?

(Incidentally, this would be fairly straightforward with an explicit loop but I’m trying to avoid explicit loops in favour of algorithms.)


Solution

  • You don't want to clear(eofbit) because the failbit should stay set if reading failed due to reaching EOF. Otherwise if you just leave eofbit set without failbit then a loop such as while (in >> s) will attempt another read after reaching EOF, and then that read will set failbit again. Except if it was using your operator>> it would clear it, and try to read again. And again. And again. The right behaviour for a stream is to set failbit if reading failed because of EOF, so just leave it set.

    To do this with iterators and an algorithm you'd need something like

    copy_while(InputIter, InputIter, OutputIter, Pred);
    

    which would copy the input sequence only while the predicate was true, but that doesn't exist in the standard library. You could certainly write one though.

    template<typename InputIter, typename OutputIter, typename Pred>
      OutputIter
      copy_while(InputIter begin, InputIter end, OutputIter result, Pred pred)
      {
        while (begin != end)
        {
          typename std::iterator_traits<InputIter>::value_type value = *begin;
          if (!pred(value))
            break;
          *result = value;
          result++;
          begin++;
        }
        return result;
      }
    

    Now you could use that like this:

    inline bool
    is_valid_seq_char(char c)
    { return std::string("ACGT").find(c) != std::string::npos; }
    
    inline std::istream&
    operator>>(std::istream& in, seq& rhs)
    {
        copy_while(
            std::istream_iterator<char>(in),
            std::istream_iterator<char>(),
            std::back_inserter(rhs),
            &is_valid_seq_char);
        return in;
    }
    
    int main()
    {
        std::istringstream istr("GATTACA FOO");
        seq s;
        assert((istr >> s) and s == "GATTACA");
    }
    

    This works, but the problem is that istream_iterator uses operator>> to read characters, so it skips over whitespace. This means the space following "GATTACA" is consumed by the algorithm and discarded, so adding this to the end of main would fail:

    assert(istr.get() == ' ');
    

    To solve this use istreambuf_iterator which doesn't skip whitespace:

    inline std::istream&
    operator>>(std::istream& in, seq& rhs)
    {
        copy_while(
            std::istreambuf_iterator<char>(in),
            std::istreambuf_iterator<char>(),
            std::back_inserter(rhs),
            &is_valid_seq_char);
        return in;
    }
    

    To complete this, you probably want to indicate failure to extract a seq if no characters where extracted:

    inline std::istream&
    operator>>(std::istream& in, seq& rhs)
    {
        copy_while( std::istreambuf_iterator<char>(in), {},
            std::back_inserter(rhs), &is_valid_seq_char);
        if (seq.empty())
          in.setstate(std::ios::failbit);  // no seq in stream
        return in;
    }
    

    That final version also uses one of my favourite C++11 tricks to simpify it slightly, by using {} for the end iterator. The type of the second argument to copy_while must be the same as the type of the first argument, which is deduced as std::istreambuf_iterator<char>, so the {} simply value-initializes another iterator of that same type.

    Edit: If you want a closer match to std::string extraction then you can do so too:

    inline std::istream&
    operator>>(std::istream& in, seq& rhs)
    {
        std::istream::sentry s(in);
        if (s)
        {
            copy_while( std::istreambuf_iterator<char>(in), {},
                        std::back_inserter(rhs), &is_valid_seq_char);
            int eof = std::char_traits<char>::eof();
            if (std::char_traits<char>::eq_int_type(in.rdbuf()->sgetc(), eof))
                in.setstate(std::ios::eofbit);
        }
        if (rhs.empty())
            in.setstate(std::ios::failbit);
        return in;
    }
    

    The sentry will skip leading whitespace and if you reach the end of the input it will set eofbit. The other change that should probably be made is to empty the seq before pushing anything into it, e.g. start with rhs.clear() or equivalent for your seq type.