With boost version 1.60 I could use #define BOOST_SPIRIT_UNICODE
and boost::spirit::unicode::char_
to process UTF-8 input strings without any further preprocessing. With boost version 1.72 this fails with an exception.
The solution seems to be to use boost::u8_to_u32_iterator
and let spirit work with wide strings. But why did it work so flawlessly in the earlier version and if possible how can I reactivate the old behavior?
Here is some sample code:
#define BOOST_SPIRIT_UNICODE
#include <boost/spirit/include/qi.hpp>
int main()
{
typedef std::string::const_iterator iterator_type;
namespace qi = boost::spirit::qi;
namespace unicode = boost::spirit::unicode;
std::string input("\"Test ⏳\"");
qi::rule<iterator_type, std::string(), unicode::space_type> quoted_string = qi::lexeme['"' >> +(unicode::char_ - '"') >> '"'];
iterator_type iter = input.begin();
iterator_type end = input.end();
std::string output;
bool r = phrase_parse(iter, end, quoted_string, unicode::space, output);
if (r && iter == end)
std::cout << "successfully parsed " << input << " to " << output << std::endl;
else
std::cout << "failed to parse " << input << std::endl;
return 0;
}
Running on my local box with Boost 1.65.1 parses successfully AND without apparent ASAN/UBSAN trippings.
I bisected the commits in the Git repo foor Spirit and found first breakage at tag for 1.72.0 (SPIRIT_VERSION 0x2058).
I found the commit that breaks it was
commit 16159fb335c9bb2040cf061e30fdd4deea9087e1 (HEAD)
Author: djowel <djowel@gmail.com>
Date: Mon Aug 26 10:15:05 2019 +0800
add invalid ascii tests + fix
That seems to have (unintentionally) regressed this because it wasn't in fact ASCII. I would file a bug with this analysis at the Boost Spirit Repo.
If if it of any use to use, just using Boost 1.76.0 but with 16159fb335c9 reverted works fine.