Search code examples
c++parsingboostboost-spirit

Parse html escape sequence with boost spirit


I try to parse text with html escape sequences and want to chnage this esaceps with they utf8 equvivalents:

  - 0xC2A0 utf8 representation
­ - 0xC2AD utf8 representation

And have gramar to solve this

template <typename Iterator>
struct HTMLEscape_grammar : qi::grammar<Iterator, std::string()>
{
    HTMLEscape_grammar() :
        HTMLEscape_grammar::base_type(text)
    {
        htmlescapes.add("&nbsp;", 0xC2AD);
        htmlescapes.add("&shy;", 0xC2AD);

        text = +((+(qi::char_ - htmlescapes)) | htmlescapes);
    }

private:
    qi::symbols<char, uint32_t> htmlescapes;
    qi::rule<Iterator, std::string()> text;
};

but when we parse

std::string l_test = "test&shy;as test simple&shy;test";
HTMLEscape_grammar<std::string::const_iterator> l_gramar;

std::string l_ast;
bool result = qi::parse(l_test.begin(), l_test.end(), l_gramar, l_ast);

We doesn't get utf-8 string, 0xC2 part of utf8 symbols simply cut, and we got simply ascii string. This parser is build block of more powerfull system so utf8 output is require.


Solution

  • I don't know how you suppose that exposing a uint32_t will magically output a UNICODE codepoint. Let alone that something will magically perform UTF8 encoding for that.

    Now let me get this straight. You desire to have selected HTML entity references replaced by 슭 (HANGUL SYLLABLE SEULG). In UTF-8 that would be 0xEC 0x8A 0xAD.

    Just do the encoding yourself (you're composing an output stream of UTF8 code units anyways):

    Live On Coliru

    #include <boost/spirit/include/qi.hpp>
    #include <iostream>
    #include <iomanip>
    
    namespace qi = boost::spirit::qi;
    
    template <typename Iterator>
    struct HTMLEscape_grammar : qi::grammar<Iterator, std::string()>
    {
        HTMLEscape_grammar() :
            HTMLEscape_grammar::base_type(text)
        {
            htmlescapes.add("&nbsp;", { '\xEC', '\x8A', '\xAD' });
            htmlescapes.add("&shy;",  { '\xEC', '\x8A', '\xAD' });
    
            text = *(htmlescapes | qi::char_);
        }
    
    private:
        qi::symbols<char, std::vector<char> > htmlescapes;
        qi::rule<Iterator, std::string()> text;
    };
    
    int main() {
        std::string const l_test = "test&shy;as test simple&shy;test";
        HTMLEscape_grammar<std::string::const_iterator> l_gramar;
    
        std::string l_ast;
        bool result = qi::parse(l_test.begin(), l_test.end(), l_gramar, l_ast);
    
        if (result) {
            std::cout << "Parse success\n";
            for (unsigned char ch : l_ast)
                std::cout << std::setw(2) << std::setfill('0') << std::hex << std::showbase << static_cast<int>(ch) << " ";
        } else
        {
            std::cout << "Parse failure\n";
        }
    }
    

    Prints

    Parse success
    0x74 0x65 0x73 0x74 0xec 0x8a 0xad 0x61 0x73 0x20 0x74 0x65 0x73 0x74 0x20 0x73 0x69 0x6d 0x70 0x6c 0x65 0xec 0x8a 0xad 0x74 0x65 0x73 0x74