Search code examples
c++boostllvmboost-spiritboost-spirit-qi

Parse arbitrary precision numbers with Boost spirit


I would like to write a Boost Spirit Qi parser that can parse arbitrary C integer literals (e.g. 1234 or 0x1234ULL) and convert them to arbitrary precision llvm::APInt values.

I imagine that to do this I need to combine separate parsers for decimal, hexadecimal etc. literals.

Taking the latter as an example, the parser would need to recognize tokens 0xNN...NS where N are hexadecimal digits and S is a valid literal suffix.

Constructing such a parser is easy but I make it "discard" prefix and suffix and return the remaining digits converted to llvm::APInt values the same way that e.g. qi::uint_ return unsigned integers?

I know there is qi::uint_parser but that class seems very limited since it seems to build up its results from integers as opposed to strings. This is a staple feature of other parser generators so I'm surprised that the documentation glosses over this.


Solution

  • I think that what is a staple of parser generators is indeed parsing into arbitrary types of integers.

    What you are after is more: you want to parse into a type that represents arbitrary types of integers with added semantic information, based on decisions in your grammar.

    These decisions cannot be baked into the parser generator, because that would tie it to a particular type of grammars.

    Of course, you can do that, too. Let me walk through step by step.

    1. The Staple

    As you have noted, Spirit does that. Let's demonstrate the basics.

    Loosely after http://www.nongnu.org/hcb/#integer-literal

    _suffix += "u", "l", "ll", "ul", "lu", "ull", "llu";
    
    _start = qi::no_case[ // case insensitive
        ("0x"          >> qi::uint_parser<Integer, 16>{} |
         "0b"          >> qi::uint_parser<Integer, 2>{} |
         &qi::lit('0') >> qi::uint_parser<Integer, 8>{} |
                          qi::uint_parser<Integer, 10>{})
        // ignored for now:
        >> -_suffix];
    

    As you can see it parses hex, binary, octal and decimal unsigned numbers with an optional suffix. We're ignoring the suffix for now, so that we can demonstrate that it parses into generalized integral types.

    See a demo Live On Compiler Explorer

    template <typename Integer> void test() {
        std::cout << " ---- " << __PRETTY_FUNCTION__ << "\n";
        using It = std::string::const_iterator;
        IntLiteral<It, Integer> const parser {};
    
        for (std::string const input : {
                 "1234",
                 "1234u",
                 "0x12f34ULL",
                 "033ULL",
                 "0b101011l",
                 "33lu"
             }) {
            Integer value;
            if (parse(input.begin(), input.end(), parser >> qi::eoi, value)) {
                std::cout << "Parsed " << std::quoted(input) << " -> " << value << "\n";
            } else {
                std::cout << "Failed to parse " << std::quoted(input) << "\n";
            }
        }
    }
    
    int main() {
        test<std::uintmax_t>();
        test<boost::multiprecision::checked_int1024_t>();
    }
    

    Prints

     ---- void test() [with Integer = long unsigned int]
    Parsed "1234" -> 1234
    Parsed "1234u" -> 1234
    Parsed "0x12f34ULL" -> 77620
    Parsed "033ULL" -> 27
    Parsed "0b101011l" -> 43
    Parsed "33lu" -> 33
     ---- void test() [with Integer = boost::multiprecision::number<boost::multiprecision::backend
    s::cpp_int_backend<1024, 1024, boost::multiprecision::signed_magnitude, boost::multiprecision:
    :checked, void> >]
    Parsed "1234" -> 1234
    Parsed "1234u" -> 1234
    Parsed "0x12f34ULL" -> 77620
    Parsed "033ULL" -> 27
    Parsed "0b101011l" -> 43
    Parsed "33lu" -> 33
    

    2. Variant Type

    Now, you actually want the result to reflect the literal's type.

    You can do that without LLVM. E.g. by parsing into intmax_t first, and then coercing to the appropriate type based on the suffix.

    Let's parse into

    using CxxInteger = boost::variant<
        signed, unsigned, 
        signed long, unsigned long,
        signed long long, unsigned long long>;
    

    Then parsing with:

    using Raw = std::uintmax_t;
    
    _start = no_case [ // case insensitive
        ("0x"       >> uint_parser<Raw, 16>{} |
         "0b"       >> uint_parser<Raw,  2>{} |
         &lit('0')  >> uint_parser<Raw,  8>{} |
                       uint_parser<Raw, 10>{})
        // ignored for now:
        >> _optsuffix
    ] [ _val = coerce_type(_1, _2) ];
    
    _optsuffix = no_case[_suffix] | attr(Suffix::signed_);
    

    Now, we have to write the semantic rules that apply to our grammar:

    struct converter_f {
        CxxInteger operator()(uintmax_t raw, Suffix sfx) const {
            switch (sfx) {
              case Suffix::signed_:   return static_cast<signed>(raw);
              case Suffix::unsigned_: return static_cast<unsigned>(raw);
              case Suffix::long_:     return static_cast<long>(raw);
              case Suffix::longlong_: return static_cast<long long>(raw);
              case Suffix::ul_:       return static_cast<unsigned long>(raw);
              case Suffix::ull_:      return static_cast<unsigned long long>(raw);
            }
            throw std::invalid_argument("sfx");
        }
    };
    boost::phoenix::function<converter_f> coerce_type;
    

    That's it. We can now parse the same test cases Live On Compiler Explorer

    std::cout << "Parsed " << std::quoted(input) << " -> " << value
              << " (type #" << value.which() << " "
              << boost::core::demangle(value.type().name()) << ")\n";
    

    Prints

     ---- void test()
    Parsed "1234" -> 1234 (type #0 int)
    Parsed "1234u" -> 1234 (type #1 unsigned int)
    Parsed "0x12f34ULL" -> 77620 (type #5 unsigned long long)
    Parsed "033ULL" -> 27 (type #5 unsigned long long)
    Parsed "0b101011l" -> 43 (type #2 long)
    Parsed "33lu" -> 33 (type #3 unsigned long)
    

    3. Applying To LLVM APInt

    The mechanics are the same:

    struct converter_f {
        template <typename T> static auto as(uint64_t raw) {
            return llvm::APInt(raw, CHAR_BIT * sizeof(T), std::is_signed_v<T>);
        }
        llvm::APInt operator()(uintmax_t raw, Suffix sfx) const {
            switch (sfx) {
            case Suffix::signed_:   return as<signed>(raw);
            case Suffix::unsigned_: return as<unsigned>(raw);
            case Suffix::long_:     return as<long>(raw);
            case Suffix::longlong_: return as<long long>(raw);
            case Suffix::ul_:       return as<unsigned long>(raw);
            case Suffix::ull_:      return as<unsigned long long>(raw);
            }
            throw std::invalid_argument("sfx");
        }
    };
    

    Full Demo

    "Live" On Compiler Explorer

    (compiler explorer doesn't support linking to LLVM)

    #include <boost/spirit/include/qi.hpp>
    #include <boost/spirit/include/phoenix.hpp>
    #include <iomanip>
    #include <llvm/ADT/APInt.h>
    namespace qi = boost::spirit::qi;
    
    template <typename It>
    struct IntLiteral : qi::grammar<It, llvm::APInt()> {
        IntLiteral() : IntLiteral::base_type(_start) {
            using namespace qi;
            using Raw = std::uint64_t;
    
            _start = no_case [ // case insensitive
                ("0x"      >> uint_parser<Raw, 16>{} |
                "0b"      >> uint_parser<Raw,  2>{} |
                &lit('0') >> uint_parser<Raw,  8>{} |
                            uint_parser<Raw, 10>{})
                // ignored for now:
                >> _optsuffix
            ] [ _val = coerce_type(_1, _2) ];
    
            _optsuffix = no_case[_suffix] | attr(Suffix::signed_);
        }
    
    private:
        enum class Suffix {
            signed_   = 0,
            unsigned_ = 1,
            long_     = 2,
            longlong_ = 4,
    
            l_   = long_,
            ll_  = longlong_,
            ul_  = unsigned_ | l_,
            ull_ = unsigned_ | ll_,
        };
    
        struct suffix_sym : qi::symbols<char, Suffix> {
            suffix_sym() {
                this->add
                    ("u",   Suffix::unsigned_)
                    ("l",   Suffix::l_)
                    ("ll",  Suffix::ll_)
                    ("ul",  Suffix::ul_)  ("lu",  Suffix::ul_)
                    ("ull", Suffix::ull_) ("llu", Suffix::ull_)
                ;
            }
        } _suffix;
    
        struct converter_f {
            template <typename T> static auto as(uint64_t raw) {
                return llvm::APInt(CHAR_BIT * sizeof(T), raw, std::is_signed_v<T>);
            }
            llvm::APInt operator()(uint64_t raw, Suffix sfx) const {
                switch (sfx) {
                case Suffix::signed_:   return as<signed>(raw);
                case Suffix::unsigned_: return as<unsigned>(raw);
                case Suffix::long_:     return as<long>(raw);
                case Suffix::longlong_: return as<long long>(raw);
                case Suffix::ul_:       return as<unsigned long>(raw);
                case Suffix::ull_:      return as<unsigned long long>(raw);
                }
                throw std::invalid_argument("sfx");
            }
        };
        boost::phoenix::function<converter_f> coerce_type;
    
        qi::rule<It, llvm::APInt()> _start;
        qi::rule<It, Suffix()>      _optsuffix;
    };
    
    void test() {
        std::cout << " ---- " << __PRETTY_FUNCTION__ << "\n";
        using It = std::string::const_iterator;
        IntLiteral<It> const parser {};
    
        for (std::string const input : {
                "1234",
                "1234u",
                "0x12f34ULL",
                "033ULL",
                "0b101011l",
                "33lu"
            }) {
            llvm::APInt value;
            if (parse(input.begin(), input.end(), parser >> qi::eoi, value)) {
                std::cout << "Parsed " << std::quoted(input) << " -> "
                        << value.toString(10, false) // TODO signed?
                        << " bits:" << value.getBitWidth() << "\n";
            } else {
                std::cout << "Failed to parse " << std::quoted(input) << "\n";
            }
        }
    }
    
    int main() {
        test();
    }
    

    Prints

     ---- void test()
    Parsed "1234" -> 1234 bits:32
    Parsed "1234u" -> 1234 bits:32
    Parsed "0x12f34ULL" -> 77620 bits:64
    Parsed "033ULL" -> 27 bits:64
    Parsed "0b101011l" -> 43 bits:64
    Parsed "33lu" -> 33 bits:64
    

    Remaining Loose Ends

    • Of course, with semantic actions you can in fact parse the string representation using the fromString factory method

    • I don't know how to accurate ask APInt whether it is signed. I suspect I should have been parsing into a variant<APInt, APSInt> to retain that information

    • I didn't put work into detecting overflows. The first example should have that out-of-the-box (thanks to Qi)

    • I also didn't put work into supporting c++14 digit separators because it wasn't specified. And it didn't seem part of any "staple" feature anyways.