Search code examples
c++boostutfc++23

converting utf8 to utf32


I really am unable to find any help online! just like I have seen in many c++23 programs, what I want to do is:

for(char32_t c : utf8string | utf8to32())

so that I can work on each individual code-point. preferably I'd like to use boost::locale for that since I already did do a

boost::locale::normalize(in.begin(),in.end(),boost::locale::norm_nfc)

in the constructor. online I've seen mention of some mysterious boost/text/transcode_iterator.hpp which does not exist on my system. but even that wouldn't provide the right class for above utf8to32. any hints on where I could find that? how do I even go about writing such a class? obviously I need to warp around another iterator over the existing iterator of std::string. and then I need to put that new iterator into a container so it is treated as such in the for loop? any examples out there how to do that if I already found a fitting iterator implementation?

just to be clear: I want to read utf8 code-points, compare them to some given code-points and maybe mark where they are for future use. easy to do in 7-bit ascii, how do I do it in utf8? not like there's plenty of fully 21-bit utf8-capable grammar-parsers around yet...


Solution

  • phuclv gave the decisive answer in a comment, I just needed to implement it. I use the installed ICU-74.2 for the low-level stuff and boost::spirit from boost-1.84.0 for reading UCD-categories.

    The idea is for the parser to have certain predefined glyphs and in user-given text-files those glyphs must be found. Keep in mind, if the file added some "mark" to the glyph, this is not the same thing anymore so the parser should ignore that.

    However, instead of presenting that, I'll just show a basic example of decomposing a string into one glyph per line in a way that hopefully can also be done header-only in constexpr in c++23:

    #include <iostream>
    #include <string_view>
    #include <vector>
    #include <boost/spirit/home/support/char_encoding/unicode/query.hpp>
    #include <unicode/utf8.h>
    
    constexpr std::string utf32to8(char32_t in){
        std::string out(U8_MAX_LENGTH+1,'\0');
        std::size_t size=0;
        U8_APPEND_UNSAFE(&*out.begin(),size,in);
        out.resize(size);
        return out;
    }
    constexpr auto utf8inc(const char8_t* s){
        char32_t out;
        std::size_t size=0;
        U8_NEXT_UNSAFE(s,size,out);
        return std::make_tuple(out,size);
    }
    constexpr auto utf8dec(const char8_t* s){
        char32_t out;
        std::size_t size=U8_MAX_LENGTH;
        U8_PREV_UNSAFE(s-U8_MAX_LENGTH,size,out);
        return std::make_tuple(out,U8_MAX_LENGTH-size);
    }
    template<class I>
    struct utf8_iterator{
        I it;
        char next;
        char32_t cur;
        constexpr utf8_iterator& operator++(){it+=next;auto [c,n]=utf8inc(&*it);cur=c;next=n;return *this;}
        constexpr utf8_iterator operator++(int){auto tmp=*this;operator++();return tmp;}
        constexpr char32_t operator*()const{return cur;}
        constexpr auto operator<=>(const utf8_iterator<I>&r)const{return it<=>r.it;}
        constexpr bool operator==(const utf8_iterator<I>&r)const{return it==r.it;}
    };
    template<class I>
    struct utf8_riterator{
        I it;
        char prev;
        char32_t cur;
        constexpr utf8_riterator& operator--(){it-=prev;auto [c,p]=utf8dec(&*it);cur=c;prev=p;return *this;}
        constexpr utf8_riterator operator++(int){auto tmp=*this;operator--();return tmp;}
        constexpr char32_t operator*()const{return cur;}
        constexpr auto operator<=>(const utf8_iterator<I>&r)const{return it<=>r.it;}
        constexpr bool operator==(const utf8_iterator<I>&r)const{return it==r.it;}
    };
    struct utf8to32_adapter{
        const std::u8string_view *sv=nullptr;
        utf8to32_adapter()=default;
        explicit utf8to32_adapter(const std::u8string_view& v) : sv(&v){}
        friend utf8to32_adapter&& operator|(const std::u8string_view& orig,utf8to32_adapter&& old){old.sv=&orig;return std::move(old);}
        using iterator=utf8_iterator<const char8_t*>;
        using reverse_iterator=utf8_riterator<const char8_t*>;
        iterator begin()const{return ++iterator{&*sv->begin(),0};}
        iterator end()const{return iterator{&*sv->end()};}
        reverse_iterator rbegin()const{return --reverse_iterator{&*sv->end(),0};}
        reverse_iterator rend()const{return reverse_iterator{&*sv->begin()};}
    };
    struct utf8codepoint_adapter{
        const std::u8string_view *sv=nullptr;
        utf8codepoint_adapter()=default;
        explicit utf8codepoint_adapter(const std::u8string_view& v) : sv(&v){}
        friend utf8codepoint_adapter&& operator|(const std::u8string_view& orig,utf8codepoint_adapter&& old){old.sv=&orig;return std::move(old);}
        struct iterator{
            const char8_t* cur;
            size_t next;
            std::mbstate_t state{};
            iterator operator++(){cur+=next;next=1+U8_COUNT_TRAIL_BYTES_UNSAFE(*cur);return *this;}//beware: this is semi-stable, could become deprecated without warning!
            iterator operator++(int){iterator tmp=*this;operator++();return tmp;}
            auto operator<=>(const iterator& o)const{return cur<=>o.cur;}
            bool operator==(const iterator& o)const{return cur==o.cur;}
            auto& operator *()const{return cur;}
        };
        iterator begin()const{return ++iterator{&*sv->begin(),0};}
        iterator end()const{return iterator{&*sv->end()};}
    };
    
    constexpr bool utf_nondeco(char32_t c){return (boost::spirit::ucd::get_major_category(c)!=boost::spirit::ucd::properties::mark);}
    
    using std::operator""sv;
    
    int main() {
        const auto in=u8"က︀ဂ︀င︀⋚︀"sv;
        std::vector<int> cutpos;
        int count=0;
        for(const char32_t& c : in|utf8to32_adapter()){
            if(utf_nondeco(c))cutpos.push_back(count);
            ++count;
        }
        count=0;
        std::vector<std::u8string_view> out;
        auto i=cutpos.begin();
        char8_t* st=nullptr;
        for(const char8_t* c: in|utf8codepoint_adapter()){
            if(i!=cutpos.end()&&count==*i){if(st)out.push_back(std::u8string_view(st,const_cast<char8_t*>(c)));st=const_cast<char8_t*>(c);++i;}
            ++count;
        }
        out.push_back(std::u8string_view(st,&*in.end()));
        for(const auto sv : out) std::cout<<*reinterpret_cast<const std::string_view*>(&sv)<<"\n";
        return 0;
    }
    

    Anybody is free to just copy my code and use it however. I hope it'll be useful for someone, there really is a lack for examples of this kind. This solution does not use any external libraries, it is purely header-only c++23 code stuffed here into a single file, with above classes depending on utf8.h from the ICU package and std::string_view only...

    I should also point out that the goal was not to actually split up the text into glyphs, as mentioned there's specific functions in various libraries for that. In fact emoticons and such will break here, which is acceptable price for the simplicity of the parsing. Instead the goal is to be future-proof, with newest versions of utf being upgraded automatically without my efforts and without the need to link any libraries. Sadly my use of non-documented boost::spitit tables in utf_nondeco() could be an issue there in future, but that's beyond the scope of this question...