Search code examples
c++boostunicodeboost-spirit-qi

Why does boost::spirit::unicode::char_ no longer work with UTF-8 char* strings?


With boost version 1.60 I could use #define BOOST_SPIRIT_UNICODE and boost::spirit::unicode::char_ to process UTF-8 input strings without any further preprocessing. With boost version 1.72 this fails with an exception.

The solution seems to be to use boost::u8_to_u32_iterator and let spirit work with wide strings. But why did it work so flawlessly in the earlier version and if possible how can I reactivate the old behavior?

Here is some sample code:

#define BOOST_SPIRIT_UNICODE
#include <boost/spirit/include/qi.hpp>

int main()
{
   typedef std::string::const_iterator iterator_type;
   namespace qi = boost::spirit::qi;
   namespace unicode = boost::spirit::unicode;

   std::string input("\"Test ⏳\"");
   qi::rule<iterator_type, std::string(), unicode::space_type> quoted_string = qi::lexeme['"' >> +(unicode::char_ - '"') >> '"'];

   iterator_type iter = input.begin();
   iterator_type end = input.end();
   std::string output;
   bool r = phrase_parse(iter, end, quoted_string, unicode::space, output);

   if (r && iter == end)
      std::cout << "successfully parsed " << input << " to " << output << std::endl;
   else
      std::cout << "failed to parse " << input << std::endl;

   return 0;
}

Solution

  • Running on my local box with Boost 1.65.1 parses successfully AND without apparent ASAN/UBSAN trippings.

    I bisected the commits in the Git repo foor Spirit and found first breakage at tag for 1.72.0 (SPIRIT_VERSION 0x2058).

    I found the commit that breaks it was

    commit 16159fb335c9bb2040cf061e30fdd4deea9087e1 (HEAD)
    Author: djowel <[email protected]>
    Date:   Mon Aug 26 10:15:05 2019 +0800
    
        add invalid ascii tests + fix
    

    That seems to have (unintentionally) regressed this because it wasn't in fact ASCII. I would file a bug with this analysis at the Boost Spirit Repo.

    If if it of any use to use, just using Boost 1.76.0 but with 16159fb335c9 reverted works fine.