c++character-encoding text-parsing byte-order-mark c++23

Encoding agnostic parsing with c++2b

Sometimes I have to parse text files with various encodings, I wonder if the upcoming standard will bring some tools for this because I'm not very happy with my current solution. I'm not even sure if this is the right approach, however I define a functor template to extract a character from stream:

#include <string>
#include <istream> // 'std::istream'

/////////////////////////////////////////////////////////////////////////////
// Generic implementation (couldn't resist to put one)
template<bool LE,typename T> class ReadChar
{
 public:
    std::istream& operator()(T& c, std::istream& in)
       {
        in.read(buf,bufsiz);
        //const std::streamsize n_read = in ? bufsiz : in.gcount();
        if(!in)
           {// Could not real all bytes
            c = std::char_traits<T>::eof();
           }
        else if constexpr (LE)
           {// Little endian
            c = buf[0];
            for(int i=1; i<bufsiz; ++i) c |= buf[i] << (8*i);
           }
        else
           {// Big endian
            const std::size_t imax = bufsiz-1;
            for(std::size_t i=0; i<imax; ++i) c |= buf[i] << (8*(imax-i));
            c |= buf[imax];
           }
        return in;
       }

 private:
    static constexpr std::size_t bufsiz = sizeof(T);
    unsigned char buf[bufsiz];
};

/////////////////////////////////////////////////////////////////////////////
// Partial specialization for 32bit chars
template<bool LE> class ReadChar<LE,char32_t>
{
 public:
    std::istream& operator()(char32_t& c, std::istream& in)
       {
        in.read(buf,4);
        if constexpr (LE) c = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24); // Little endian
        else              c = (buf[0] << 24) | (buf[1] << 16) | (buf[2] << 8) | buf[3]; // Big endian
        return in;
       }

 private:
    char buf[4];
};

/////////////////////////////////////////////////////////////////////////////
// Partial specialization for 16bit chars
template<bool LE> class ReadChar<LE,char16_t>
{
 public:
    std::istream& operator()(char16_t& c, std::istream& in)
       {
        in.read(buf,2);
        if constexpr (LE) c = buf[0] | (buf[1] << 8); // Little endian
        else              c = (buf[0] << 8) | buf[1]; // Big endian
        return in;
       }

 private:
    char buf[2];
};

/////////////////////////////////////////////////////////////////////////////
// Specialization for 8bit chars
template<> class ReadChar<false,char>
{
 public:
    std::istream& operator()(char& c, std::istream& in)
       {
        return in.get(c);
       }
};

I use ReadChar to implement the parsing function:

template<typename T,bool LE> void parse(std::istream& fin)
{
    ReadChar<LE,T> get;
    T c;
    while( get(c,fin) )
       {
        if(c==static_cast<T>('a')) {/* ... */} // Ugly comparison of T with a char literal
       }
}

The ugly part are the static_cast when I need to compare to a char literal.

Then I use parse with this ugly boilerplate code:

#include <fstream> // 'std::ifstream'
std::ifstream fin("/path/to/file", std::ios::binary);
auto bom = check_bom(fin); // 'check_bom' function is quite trivial
     if( bom.is_empty()  )  parse<char>(fin);
else if( bom.is_utf8() )    parse<char>(fin); // In my case there's no need to handle multi-byte chars
else if( bom.is_utf16le() ) parse<char16_t,true>(fin);
else if( bom.is_utf16be() ) parse<char16_t,false>(fin);
else if( bom.is_utf32le() ) parse<char32_t,true>(fin);
else if( bom.is_utf32be() ) parse<char32_t,false>(fin);
else                        throw std::runtime_error("Unrecognized BOM");

Now, this solution has some quirks (can't use directly string literals in parse) my question is if there are alternative approaches to this problem, maybe using existing or upcoming standard facilities that I'm ignoring.

Solution

In c++17 we gained type-safe unions. These can be used to map between runtime and compile time state together with std::visit.

template<auto x>
using constant_t = std::integral_constant<std::decay_t<decltype(x)>, x>;
template<auto x>
constexpr constant_t<x> constant = {};

template<auto...Xs>
using variant_enum_t = std::variant< constant_t<Xs>... >;

enum class EBom {
  None,
  utf8,
  utf16le,
  utf16be,
  utf32le,
  utf32be,
  count,
};
// you could use the existence of EBom::count and the
// assumption of contiguous indexes to automate this as well:
using VEBom = variant_enum< EBom::None, EBom::utf8, EBom::utf16le, EBom::utf16be, EBom::utf32le, EBom::utf32be >;

template<std::size_t...Is>
constexpr VEBom make_ve_bom( EBom bom, std::index_sequence<Is...> ) {
  static constexpr VEBom retvals[] = {
    constant<static_cast<EBom>(Is)>...
  };
  return retvals[ static_cast<std::size_t>(bom) ];
}
constexpr VEBom make_ve_bom( EBom bom ) {
  return make_ve_bom( bom, std::make_index_sequence< static_cast<std::size_t>(EBom::count) >{} );
}

And now, with a runtime EBom value, we can produce a VEBom.

With that VEBom we can get at the type at compile time. Suppose you have traits, like:

template<EBom>
constexpr boom bom_is_bigendian_v = ???;
template<EBom>
using bom_chartype_t = ???;

you can now write code like:

std::visit( vebom, [&](auto bom) {
  bom_chartype_t<bom> next = ???;
  if constexpr (bom_is_bigendian_v<bom>) {
    // swizzle
  }

} );

etc.

Your non-DRY code

template<bool LE, class char_t> class ReadChar {
public:
  std::istream& operator()(char_t& c, std::istream& in)
  {
    in.read(buf,sizeof(char_t));
    c = buf[0] | (buf[1] << 8);
    if constexpr(!LE)
      reverse_bytes(&c);
    return in;
  }
private:
  char buf[sizeof(char_t)];
};

becomes DRY with a simple rewrite.

Your boilerplate becomes:

std::ifstream fin("/path/to/file", std::ios::binary);
auto bom = check_bom(fin); // 'check_bom' function is quite trivial
if (bom.invalid())
  throw std::runtime_error("Unrecognized BOM");

auto vebom = make_ve_bom( bom.getEnum() );
std:visit( vebom, [&]( auto ebom ) {
  parse<bom_chartype_t<ebom>, !bom_is_bigendian_v<ebom>>( fin );
});

and the magic is done elsewhere.

That magic here is that the std::variant holds a bunch of integral_constants, each of which is both stateless and knows (in its type) what its value is.

So the only state in the std::variant is which of the stateless enum values it contains.

std::visit proceeds to call the passed in lambda with whichever stateless std::integral_constant that is in the std::variant. Within that lambda, we can use its value as a compile time constant, like we would with any other std::integral_constant.

The runtime state of the std::variant is actually the value of the EBom because of how we set it up, so converting an EBom to a VEBom is literally copying the number over (so, free). The magic is in std::visit, which automates writing the switch statement and injecting the compile time (integral constant) value for each of the possibilities into your code.

None of this is c++23. Most of it is c++17, I may have used a c++20 feature in there as well.

The above code is not compiled, it is just written. It probably contains typos, but the technique is sound.

We can automate the making of the variant type:

template<class Enum, std::size_t...Is, class VEnum=variant_enum<
  constant_t<static_cast<Enum>(Is)>...
>>
constexpr VEnum make_venum( Enum e, std::index_sequence<Is...> ) {
  static constexpr VEnum retvals[] = {
    constant<static_cast<Enum>(Is)>...
  };
  return retvals[ static_cast<std::size_t>(e) ];
}
template<class Enum>
constexpr auto make_venum( Enum e ) {
  return make_venum( e, std::make_index_sequence< static_cast<std::size_t>(Enum::count) >{} );
}
template<class Enum>
using venum_t = decltype(make_venum( static_cast<Enum>(0) ));

now our VEBom is just:

using VEBom = venum_t<EBom>;

Anyhow, a live example with typos fixed.