Search code examples
c++boost-spiritboost-spirit-x3

Spirit X3, ascii::cntrl why disparity with std::iscntrl?


I'm concentrating on checking for error conditions in an parser design using Spirit X3. One of which is the character category checks like isalpha or ispunct. According to the X3 documentation Character Parsers they should match what C++ provides as std::isalpha and std::ispunct. However with a code demonstration shown below I do get different results.

#include <cstddef>
#include <cstdio>
#include <cstdint>
#include <cctype>
#include <iostream>
#include <boost/spirit/home/x3/version.hpp>
#include <boost/spirit/home/x3.hpp>

namespace client::parser
{
  namespace x3 = boost::spirit::x3;
  namespace ascii = boost::spirit::x3::ascii;

  using ascii::char_;
  using ascii::space;
  using x3::skip;

  x3::rule<class main_rule_id, char> const main_rule_ = "main_rule";
  const auto main_rule__def = ascii::cntrl;

  BOOST_SPIRIT_DEFINE( main_rule_ ) 
  const auto entry_point = skip(space) [ main_rule_ ];
}

int main()
{
  printf( "Spirit X3 version: %4.4x\n", SPIRIT_X3_VERSION );

  char output;

  bool r = false;
  bool r2 = false; // answer according to default "C" locale
  char input[2];
  input[1] = 0;

  printf( "ascii::cntrl\n" );

  uint8_t i = 0;
  next_char:  
    input[0] = (char)i;
    r = parse( (char*)input, input+1, client::parser::entry_point, output );
    r2 = (bool)std::iscntrl( (unsigned char)i );
    printf( "%2.2x:%d%d", i, r, r2 );
    if ( i == 0x7f ) { goto exit_loop; }
    ++i;
    if ( i % 8 ) { putchar( ' ' ); } else { putchar( '\n' ); }
    goto next_char;
  exit_loop:

  return 0;
}

The output is:

Spirit X3 version: 3004
ascii::cntrl
00:11 01:11 02:11 03:11 04:11 05:11 06:11 07:11
08:11 09:01 0a:01 0b:01 0c:01 0d:01 0e:11 0f:11
10:11 11:11 12:11 13:11 14:11 15:11 16:11 17:11
18:11 19:11 1a:11 1b:11 1c:11 1d:11 1e:11 1f:11
20:00 21:00 22:00 23:00 24:00 25:00 26:00 27:00
28:00 29:00 2a:00 2b:00 2c:00 2d:00 2e:00 2f:00
30:00 31:00 32:00 33:00 34:00 35:00 36:00 37:00
38:00 39:00 3a:00 3b:00 3c:00 3d:00 3e:00 3f:00
40:00 41:00 42:00 43:00 44:00 45:00 46:00 47:00
48:00 49:00 4a:00 4b:00 4c:00 4d:00 4e:00 4f:00
50:00 51:00 52:00 53:00 54:00 55:00 56:00 57:00
58:00 59:00 5a:00 5b:00 5c:00 5d:00 5e:00 5f:00
60:00 61:00 62:00 63:00 64:00 65:00 66:00 67:00
68:00 69:00 6a:00 6b:00 6c:00 6d:00 6e:00 6f:00
70:00 71:00 72:00 73:00 74:00 75:00 76:00 77:00
78:00 79:00 7a:00 7b:00 7c:00 7d:00 7e:00 7f:11

So the first bit after the colon is the answer according to X3 and the second bit is the answer according to C++. The mismatch happens on the characters that also fall into the category isspace. Recently I'm more looking into the library headers, but I still haven't found a part that explains this behavior.

Why the disparity? Do I have missed something?

Oh yeah, I love my goto statements. And my retro C style. I hope you do too! Even for an X3 parser.


Solution

  • You accidentally run amok with the skipper which eats any whitespace before you can actually parse it.

    I simplified the parser and now it succeeds:

    As a note about style: there's no reason ever to

    • use C style casts (they're dangerous)
    • write a loop with goto (considered harmful)
    • use cryptic variable names (r, r2?)

    Live On Coliru

    #include <boost/spirit/home/x3/version.hpp>
    #include <boost/spirit/home/x3.hpp>
    #include <cctype>
    #include <cstddef>
    #include <cstdint>
    #include <iostream>
    #include <iomanip>
    
    namespace client::parser {
        using namespace boost::spirit::x3;
        //const auto entry_point = skip(space)[ ascii::cntrl ];
        const auto entry_point = ascii::cntrl;
    }
    
    int main() {
        std::cout << std::boolalpha << std::hex << std::setfill('0');
        std::cout << "Spirit X3 version: " << SPIRIT_X3_VERSION << "\n";
    
        for (uint8_t i = 0; i <= 0x7f; ++i) {
            auto from_x3  = parse(&i, &i + 1, client::parser::entry_point);
            auto from_std = !!std::iscntrl(i);
    
            if (from_x3 != from_std) {
                std::cout << "0x" << std::setw(2) << static_cast<unsigned>(i) << "\tx3:" << from_x3 << "\tstd:" << from_std << '\n';
            }
        }
    
        std::cout << "Done\n";
    }
    

    Prints simply

    Spirit X3 version: 3000
    Done
    

    With the "bad line" commented in instead:

    Live On Coliru

    Spirit X3 version: 3000
    0x09    x3:false    std:true
    0x0a    x3:false    std:true
    0x0b    x3:false    std:true
    0x0c    x3:false    std:true
    0x0d    x3:false    std:true
    Done