Search code examples
c++boost

c++ boost base64 decoder fails when newlines are present


when given base64 'text' to decode, that contains new lines, the following will throw an exception - non base64 characters present. In reference to the new lines.

terminate called after throwing an instance of 'boost::archive::iterators::dataflow_exception' what(): attempt to decode a value not in base64 char set

Does anyone know how to tell boost to gracefully handle newlines? I recognize I could remove them myself from the string, prior to decoding, but was hoping and guessing there was a more streamlined way.

typedef transform_width< binary_from_base64<remove_whitespace<char*>>, 8, 6 > base64_dec;
unsigned int size = s.size(); //where 's' is the string holding the base64 characters to include newlines at every 76th character
std::string decoded_token(base64_dec(s.c_str()), base64_dec(s.c_str() + size));

Solution

  • Okay, so the trouble is that newlines are not the problem. The filter_iterator is just fundamentally broken.

    It will cause Undefined Behaviour as soon as the input sequence ends in a character that doesn't satisfy the filter predicate (in this case, a whitespace character):

    Crashing Live On Compiler Explorer

    #include <boost/archive/iterators/remove_whitespace.hpp>
    #include <iomanip>
    #include <iostream>
    
    namespace bai = boost::archive::iterators;
    
    int main() {
        using It = bai::remove_whitespace<const char*>;
        std::string const s = "oops "; // ends in whitespace, causes UB
    
        std::string filtered(It(s.c_str()), It(s.c_str() + s.length()));
        std::cout << std::quoted(filtered) << std::flush;
    }
    

    Prints, with ASan enabled (without it it just segfaults):

    =================================================================
    ==1==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fffecf132b0 at pc 0x00000040390d bp 0x7fffecf12fd0 sp 0x7fffecf12fc8
    READ of size 1 at 0x7fffecf132b0 thread T0
        #0 0x40390c in dereference_impl /opt/compiler-explorer/libs/boost_1_78_0/boost/archive/iterators/remove_whitespace.hpp:105
        #1 0x40390c in dereference /opt/compiler-explorer/libs/boost_1_78_0/boost/archive/iterators/remove_whitespace.hpp:113
        #2 0x40390c in dereference<boost::archive::iterators::filter_iterator<(anonymous namespace)::remove_whitespace_predicate<char>, char const*> > /opt/compiler-explorer/libs/boost_1_78_0/boost/iterator/iterator_facade.hpp:550
        #3 0x40390c in operator* /opt/compiler-explorer/libs/boost_1_78_0/boost/iterator/iterator_facade.hpp:656
        #4 0x40390c in void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<boost::archive::iterators::remove_whitespace<char const*> >(boost::archive::iterators::remove_whitespace<char const*>, boost::archive::iterators::remove_whitespace<char const*>, std::input_iterator_tag) /opt/compiler-explorer/gcc-trunk-20220419/include/c++/12.0.1/bits/basic_string.tcc:204
        #5 0x40390c in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<boost::archive::iterators::remove_whitespace<char const*>, void>(boost::archive::iterators::remove_whitespace<char const*>, boost::archive::iterators::remove_whitespace<char const*>, std::allocator<char> const&) /opt/compiler-explorer/gcc-trunk-20220419/include/c++/12.0.1/bits/basic_string.h:756
        #6 0x40390c in main /app/example.cpp:11
        #7 0x7fb625b560b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x240b2)
        #8 0x4041ed in _start (/app/output.s+0x4041ed)
    
    Address 0x7fffecf132b0 is located in stack of thread T0 at offset 576 in frame
        #0 0x40245f in main /app/example.cpp:7
    
      This frame has 21 object(s):
        [32, 33) '<unknown>'
        [48, 49) '<unknown>'
        [64, 65) '<unknown>'
        [80, 81) '<unknown>'
        [96, 97) '<unknown>'
        [112, 113) '<unknown>'
        [128, 136) 'start'
        [160, 168) 'start'
        [192, 200) '__guard'
        [224, 232) 'start'
        [256, 264) 'start'
        [288, 296) '__capacity'
        [320, 328) '__guard'
        [352, 368) '<unknown>'
        [384, 400) '<unknown>'
        [416, 432) '<unknown>'
        [448, 464) '<unknown>'
        [480, 496) '<unknown>'
        [512, 528) '<unknown>'
        [544, 576) 's' (line 9) <== Memory access at offset 576 overflows this variable
        [608, 640) 'filtered' (line 11)
    HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
          (longjmp and C++ exceptions *are* supported)
    SUMMARY: AddressSanitizer: stack-buffer-overflow /opt/compiler-explorer/libs/boost_1_78_0/boost/archive/iterators/remove_whitespace.hpp:105 in dereference_impl
    Shadow bytes around the buggy address:
      0x10007d9da600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
      0x10007d9da610: f1 f1 f8 f2 01 f2 f8 f2 f8 f2 f8 f2 01 f2 00 f2
      0x10007d9da620: f2 f2 00 f2 f2 f2 f8 f2 f2 f2 00 f2 f2 f2 00 f2
      0x10007d9da630: f2 f2 00 f2 f2 f2 00 f2 f2 f2 00 00 f2 f2 00 00
      0x10007d9da640: f2 f2 00 00 f2 f2 00 00 f2 f2 00 00 f2 f2 00 00
    =>0x10007d9da650: f2 f2 00 00 00 00[f2]f2 f2 f2 00 00 00 00 f3 f3
      0x10007d9da660: f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x10007d9da670: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x10007d9da680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x10007d9da690: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x10007d9da6a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    Shadow byte legend (one shadow byte represents 8 application bytes):
      Addressable:           00
      Partially addressable: 01 02 03 04 05 06 07 
      Heap left redzone:       fa
      Freed heap region:       fd
      Stack left redzone:      f1
      Stack mid redzone:       f2
      Stack right redzone:     f3
      Stack after return:      f5
      Stack use after scope:   f8
      Global redzone:          f9
      Global init order:       f6
      Poisoned by user:        f7
      Container overflow:      fc
      Array cookie:            ac
      Intra object redzone:    bb
      ASan internal:           fe
      Left alloca redzone:     ca
      Right alloca redzone:    cb
    ==1==ABORTING
    

    You can consider yourself lucky that you noticed due to symptoms, instead of it eating your puppy or launching the nukes in production code.

    This should be reported. There is no test for filter_iterator (or even remove_whitespace for that matter) in isolation, and previous tickets seem to indicate a stance like "It Works For Me" (where "Me" is the Boost Serialization Library). See e.g. https://github.com/boostorg/serialization/issues/135.

    Interestingly the analysis at that ticket makes zero sense, since there hasn't been a two-iterator constructor for filter_iterator since ... forever. I can only guess that Robert mistakenly looked at Boost Iterator instead of the filter_iterator from Boost Archive.

    So my hunch was to recommend you use filter_iterator.hpp from Boost Iterator instead. Ironically it took several tries (and a trip to cppslack/github) to get right, but here it is: Live On Compiler Explorer

    Fix Using Boost Iterator's filter_iterator

    We should be able to fix using the working filter_iterator implementation:

    using FiltIt =
        boost::iterators::filter_iterator<IsGraph, std::string::const_iterator>;
    
    using base64_dec =                              //
        bai::transform_width<                       //
            bai::binary_from_base64<FiltIt>, 8, 6>; //
    

    Now, it's still tricky to get it right. Notably, the naive approach will just fail with UB again:

    // CAUTION: this invokes UB:
    std::string filtered(base64_dec(s.begin()), base64_dec(s.end()));
    

    It's the curse of implicit conversions + default arguments. Instead we MUST explicitly construct the FiltIt separately:

    FiltIt f(IsGraph{}, s.begin(), s.end()), // !!
        l(f.predicate(), f.end(), f.end());  // !!
    

    Now we can "just" use them in base64_dec:

    std::string filtered(base64_dec{f}, base64_dec{l});
    

    Note the uniform {} initializers to sidestep Most Vexing Parse

    Live On Compiler Explorer

    #include <boost/archive/iterators/binary_from_base64.hpp>
    #include <boost/archive/iterators/transform_width.hpp>
    #include <boost/iterator/filter_iterator.hpp>
    #include <iomanip>
    #include <iostream>
    
    namespace bai = boost::archive::iterators;
    static std::string const s = "aGVsbG8g\nd29ybGQK";
    
    int main() {
        std::cout << std::unitbuf;
    
        struct IsGraph {
            // unsigned char prevents sign extension
            bool operator()(unsigned char ch) const {
                return std::isgraph(ch); // !std::isspace
            }
        };
    
        using FiltIt =
            boost::iterators::filter_iterator<IsGraph, std::string::const_iterator>;
    
        using base64_dec =                              //
            bai::transform_width<                       //
                bai::binary_from_base64<FiltIt>, 8, 6>; //
    
        //// CAUTION: this invokes UB:
        //std::string filtered(base64_dec(s.begin()), base64_dec(s.end()));
        FiltIt f(IsGraph{}, s.begin(), s.end()),   // !!
            l(f.predicate(), f.end(), f.end()); // !!
    
        std::string filtered(base64_dec{f}, base64_dec{l});
        std::cout << "OUT:" << std::quoted(filtered) << std::endl;
    }
    

    Prints itself (minus the sample base64)

    OUT:"hello world
    "
    

    Summary/TL;DR

    Consider the lurking bugs and undocumented limitations, consider using a proper, simpler base64 implementation.

    Beast has one in its implementation details, so that's also unsupported, but chances are pretty high that it is at least less brittle.

    Alternatively the must be a library that has proper tests and documentation.