Sometimes I have to parse text files with various encodings, I wonder if the upcoming standard will bring some tools for this because I'm not very happy with my current solution. I'm not even sure if this is the right approach, however I define a functor template to extract a character from stream:
#include <string>
#include <istream> // 'std::istream'
/////////////////////////////////////////////////////////////////////////////
// Generic implementation (couldn't resist to put one)
template<bool LE,typename T> class ReadChar
{
public:
std::istream& operator()(T& c, std::istream& in)
{
in.read(buf,bufsiz);
//const std::streamsize n_read = in ? bufsiz : in.gcount();
if(!in)
{// Could not real all bytes
c = std::char_traits<T>::eof();
}
else if constexpr (LE)
{// Little endian
c = buf[0];
for(int i=1; i<bufsiz; ++i) c |= buf[i] << (8*i);
}
else
{// Big endian
const std::size_t imax = bufsiz-1;
for(std::size_t i=0; i<imax; ++i) c |= buf[i] << (8*(imax-i));
c |= buf[imax];
}
return in;
}
private:
static constexpr std::size_t bufsiz = sizeof(T);
unsigned char buf[bufsiz];
};
/////////////////////////////////////////////////////////////////////////////
// Partial specialization for 32bit chars
template<bool LE> class ReadChar<LE,char32_t>
{
public:
std::istream& operator()(char32_t& c, std::istream& in)
{
in.read(buf,4);
if constexpr (LE) c = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24); // Little endian
else c = (buf[0] << 24) | (buf[1] << 16) | (buf[2] << 8) | buf[3]; // Big endian
return in;
}
private:
char buf[4];
};
/////////////////////////////////////////////////////////////////////////////
// Partial specialization for 16bit chars
template<bool LE> class ReadChar<LE,char16_t>
{
public:
std::istream& operator()(char16_t& c, std::istream& in)
{
in.read(buf,2);
if constexpr (LE) c = buf[0] | (buf[1] << 8); // Little endian
else c = (buf[0] << 8) | buf[1]; // Big endian
return in;
}
private:
char buf[2];
};
/////////////////////////////////////////////////////////////////////////////
// Specialization for 8bit chars
template<> class ReadChar<false,char>
{
public:
std::istream& operator()(char& c, std::istream& in)
{
return in.get(c);
}
};
I use ReadChar
to implement the parsing function:
template<typename T,bool LE> void parse(std::istream& fin)
{
ReadChar<LE,T> get;
T c;
while( get(c,fin) )
{
if(c==static_cast<T>('a')) {/* ... */} // Ugly comparison of T with a char literal
}
}
The ugly part are the static_cast
when I need to compare to a char literal.
Then I use parse
with this ugly boilerplate code:
#include <fstream> // 'std::ifstream'
std::ifstream fin("/path/to/file", std::ios::binary);
auto bom = check_bom(fin); // 'check_bom' function is quite trivial
if( bom.is_empty() ) parse<char>(fin);
else if( bom.is_utf8() ) parse<char>(fin); // In my case there's no need to handle multi-byte chars
else if( bom.is_utf16le() ) parse<char16_t,true>(fin);
else if( bom.is_utf16be() ) parse<char16_t,false>(fin);
else if( bom.is_utf32le() ) parse<char32_t,true>(fin);
else if( bom.is_utf32be() ) parse<char32_t,false>(fin);
else throw std::runtime_error("Unrecognized BOM");
Now, this solution has some quirks (can't use directly string literals in parse
)
my question is if there are alternative approaches to this problem,
maybe using existing or upcoming standard facilities that I'm ignoring.
In c++17 we gained type-safe unions. These can be used to map between runtime and compile time state together with std::visit
.
template<auto x>
using constant_t = std::integral_constant<std::decay_t<decltype(x)>, x>;
template<auto x>
constexpr constant_t<x> constant = {};
template<auto...Xs>
using variant_enum_t = std::variant< constant_t<Xs>... >;
enum class EBom {
None,
utf8,
utf16le,
utf16be,
utf32le,
utf32be,
count,
};
// you could use the existence of EBom::count and the
// assumption of contiguous indexes to automate this as well:
using VEBom = variant_enum< EBom::None, EBom::utf8, EBom::utf16le, EBom::utf16be, EBom::utf32le, EBom::utf32be >;
template<std::size_t...Is>
constexpr VEBom make_ve_bom( EBom bom, std::index_sequence<Is...> ) {
static constexpr VEBom retvals[] = {
constant<static_cast<EBom>(Is)>...
};
return retvals[ static_cast<std::size_t>(bom) ];
}
constexpr VEBom make_ve_bom( EBom bom ) {
return make_ve_bom( bom, std::make_index_sequence< static_cast<std::size_t>(EBom::count) >{} );
}
And now, with a runtime EBom
value, we can produce a VEBom
.
With that VEBom
we can get at the type at compile time. Suppose you have traits, like:
template<EBom>
constexpr boom bom_is_bigendian_v = ???;
template<EBom>
using bom_chartype_t = ???;
you can now write code like:
std::visit( vebom, [&](auto bom) {
bom_chartype_t<bom> next = ???;
if constexpr (bom_is_bigendian_v<bom>) {
// swizzle
}
} );
etc.
Your non-DRY code
template<bool LE, class char_t> class ReadChar {
public:
std::istream& operator()(char_t& c, std::istream& in)
{
in.read(buf,sizeof(char_t));
c = buf[0] | (buf[1] << 8);
if constexpr(!LE)
reverse_bytes(&c);
return in;
}
private:
char buf[sizeof(char_t)];
};
becomes DRY with a simple rewrite.
Your boilerplate becomes:
std::ifstream fin("/path/to/file", std::ios::binary);
auto bom = check_bom(fin); // 'check_bom' function is quite trivial
if (bom.invalid())
throw std::runtime_error("Unrecognized BOM");
auto vebom = make_ve_bom( bom.getEnum() );
std:visit( vebom, [&]( auto ebom ) {
parse<bom_chartype_t<ebom>, !bom_is_bigendian_v<ebom>>( fin );
});
and the magic is done elsewhere.
That magic here is that the std::variant
holds a bunch of integral_constants
, each of which is both stateless and knows (in its type) what its value is.
So the only state in the std::variant
is which of the stateless enum values it contains.
std::visit
proceeds to call the passed in lambda with whichever stateless std::integral_constant
that is in the std::variant
. Within that lambda, we can use its value as a compile time constant, like we would with any other std::integral_constant
.
The runtime state of the std::variant
is actually the value of the EBom
because of how we set it up, so converting an EBom
to a VEBom
is literally copying the number over (so, free). The magic is in std::visit
, which automates writing the switch statement and injecting the compile time (integral constant) value for each of the possibilities into your code.
None of this is c++23. Most of it is c++17, I may have used a c++20 feature in there as well.
The above code is not compiled, it is just written. It probably contains typos, but the technique is sound.
--
We can automate the making of the variant type:
template<class Enum, std::size_t...Is, class VEnum=variant_enum<
constant_t<static_cast<Enum>(Is)>...
>>
constexpr VEnum make_venum( Enum e, std::index_sequence<Is...> ) {
static constexpr VEnum retvals[] = {
constant<static_cast<Enum>(Is)>...
};
return retvals[ static_cast<std::size_t>(e) ];
}
template<class Enum>
constexpr auto make_venum( Enum e ) {
return make_venum( e, std::make_index_sequence< static_cast<std::size_t>(Enum::count) >{} );
}
template<class Enum>
using venum_t = decltype(make_venum( static_cast<Enum>(0) ));
now our VEBom
is just:
using VEBom = venum_t<EBom>;
Anyhow, a live example with typos fixed.