Search code examples

Parse html escape sequence with boost spirit

I try to parse text with html escape sequences and want to chnage this esaceps with they utf8 equvivalents:

  - 0xC2A0 utf8 representation
­ - 0xC2AD utf8 representation

And have gramar to solve this

template <typename Iterator>
struct HTMLEscape_grammar : qi::grammar<Iterator, std::string()>
    HTMLEscape_grammar() :
        htmlescapes.add("&nbsp;", 0xC2AD);
        htmlescapes.add("&shy;", 0xC2AD);

        text = +((+(qi::char_ - htmlescapes)) | htmlescapes);

    qi::symbols<char, uint32_t> htmlescapes;
    qi::rule<Iterator, std::string()> text;

but when we parse

std::string l_test = "test&shy;as test simple&shy;test";
HTMLEscape_grammar<std::string::const_iterator> l_gramar;

std::string l_ast;
bool result = qi::parse(l_test.begin(), l_test.end(), l_gramar, l_ast);

We doesn't get utf-8 string, 0xC2 part of utf8 symbols simply cut, and we got simply ascii string. This parser is build block of more powerfull system so utf8 output is require.


  • I don't know how you suppose that exposing a uint32_t will magically output a UNICODE codepoint. Let alone that something will magically perform UTF8 encoding for that.

    Now let me get this straight. You desire to have selected HTML entity references replaced by 슭 (HANGUL SYLLABLE SEULG). In UTF-8 that would be 0xEC 0x8A 0xAD.

    Just do the encoding yourself (you're composing an output stream of UTF8 code units anyways):

    Live On Coliru

    #include <boost/spirit/include/qi.hpp>
    #include <iostream>
    #include <iomanip>
    namespace qi = boost::spirit::qi;
    template <typename Iterator>
    struct HTMLEscape_grammar : qi::grammar<Iterator, std::string()>
        HTMLEscape_grammar() :
            htmlescapes.add("&nbsp;", { '\xEC', '\x8A', '\xAD' });
            htmlescapes.add("&shy;",  { '\xEC', '\x8A', '\xAD' });
            text = *(htmlescapes | qi::char_);
        qi::symbols<char, std::vector<char> > htmlescapes;
        qi::rule<Iterator, std::string()> text;
    int main() {
        std::string const l_test = "test&shy;as test simple&shy;test";
        HTMLEscape_grammar<std::string::const_iterator> l_gramar;
        std::string l_ast;
        bool result = qi::parse(l_test.begin(), l_test.end(), l_gramar, l_ast);
        if (result) {
            std::cout << "Parse success\n";
            for (unsigned char ch : l_ast)
                std::cout << std::setw(2) << std::setfill('0') << std::hex << std::showbase << static_cast<int>(ch) << " ";
        } else
            std::cout << "Parse failure\n";


    Parse success
    0x74 0x65 0x73 0x74 0xec 0x8a 0xad 0x61 0x73 0x20 0x74 0x65 0x73 0x74 0x20 0x73 0x69 0x6d 0x70 0x6c 0x65 0xec 0x8a 0xad 0x74 0x65 0x73 0x74