Search code examples
c++stringencodingbidi

How to use fribidi with std::string?


I'm trying to write a function that will run the fribidi algorithm on a std::string and return a reordered std::string. I hope it to be safe enough for any std::string, and in case something fails in the way, it can return the original std::string.

I saw many examples online that use std::wstring, but I wonder whether I can avoid this conversion. Here's my attempt (I may have forgotten some includes).

# fribidi-test.cpp
#include <cstring>
#include <iostream>
#include <string>
#include <stdio.h>
#define FRIBIDI_NO_DEPRECATED
#include <fribidi/fribidi.h>

std::string fribidi_str_convert(std::string string_orig) {
    std::cerr << "dbg: orig: " + string_orig + "\n";
    FriBidiChar fribidi_in_char;
    FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(
        FRIBIDI_CHAR_SET_UTF8,
        string_orig.c_str(),
        string_orig.size(),
        &fribidi_in_char
    );
    fprintf(stderr, "len is %i\n", fribidi_len);
    // https://github.com/fribidi/fribidi#api
    // Let fribidi think about the main direction by it's own (https://stackoverflow.com/q/58166995/4935114)
    FriBidiCharType fribidi_pbase_dir = FRIBIDI_TYPE_LTR;
    // Prepare output variable
    FriBidiChar     fribidi_visual_char;
    fribidi_boolean stat = fribidi_log2vis(
        /* input */
        &fribidi_in_char,
        fribidi_len,
        &fribidi_pbase_dir,
        /* output */
        &fribidi_visual_char,
        NULL,
        NULL,
        NULL
    );
    fprintf(stderr, "stat is: %d\n", stat);
    if (stat) {
        char string_formatted_ptr;
        // Convert from fribidi unicode back to ptr
        FriBidiStrIndex new_len = fribidi_unicode_to_charset(
            FRIBIDI_CHAR_SET_UTF8,
            &fribidi_visual_char,
            fribidi_len,
            &string_formatted_ptr
        );
        fprintf(stderr, "new_len is: %d\n", new_len);
        if (new_len) {
            fprintf(stderr, "string_formatted_ptr is: %s\n", &string_formatted_ptr);
            std::string string_formatted_out(&string_formatted_ptr, new_len);
            return string_formatted_out;
        };
    };
    return string_orig;
};

int main() {
    std::string orig = "אריק איינשטיין";
    std::cerr << "main: orig: " + orig + "\n";
    std::cerr << "main: transformed: " + fribidi_str_convert(orig) + "\n";
};

I compile and run it with:

g++ $(pkg-config --libs fribidi) fribidi-test.cpp -o fribidi-test && ./fribidi-test

My problem is that I'm getting a malformed output:

main: orig: ןייטשנייא קירא
dbg: orig: ןייטשנייא קירא
len is 14
stat is: 2
new_len is: 27
string_formatted_ptr is: אĐןייטשנייא קי
main: transformed: אĐןייטשנייא ק

That Đ character is not supposed to be there. What I want to get is:

main: orig: ןייטשנייא קירא
dbg: orig: ןייטשנייא קירא
len is 14
stat is: 2
new_len is: 27
string_formatted_ptr is: אריק איינשטיין
main: transformed: אריק איינשטיין

Is this related to UTF16 encoding? and the fact that the new length is 27 - almost twice as that of the original length?


Solution

  • This is very wrong. You can't expect to store a string into one character. Char is a char. It is not a pointer. Not a string. Remember to compile your programs with -fsanitize=undefined and also check with valgrind.

    FriBidiChar fribidi_in_char;
    FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(.... 
        &fribidi_in_char
    );
        char string_formatted_ptr;
        FriBidiStrIndex new_len = fribidi_unicode_to_charset(...
            &string_formatted_ptr
        );
    

    Also, }; - just use }. There is no (need for) ; after } (in these cases).

    It's cstdio in C++.

    Prefer to << string << string instead of << string + string to (I think) reduce memory allocations.

    Fribidi API is bad, because I do not see how to calculate memory needed for the charset_to_unicode. Even the fribidy program - https://github.com/fribidi/fribidi/blob/cffa3047a0db9f4cd391d68bf98ce7b7425be245/bin/fribidi-main.c#L64 - just uses a constant amount of super big value. Also, fribidi program is the example that does not use std::wstring, because it is in C.

    The following program uses a constant big buffer size like fribidi program:

    #include <cassert>
    #include <cstdio>
    #include <cstring>
    #include <iostream>
    #include <string>
    #include <string_view>
    #include <vector>
    #include <iomanip>
    #define FRIBIDI_NO_DEPRECATED
    #include <fribidi/fribidi.h>
    
    #define MAX_STR_LEN 65000
    
    std::string fribidi_str_convert(const std::string& string_orig) {
            std::cerr << "dbg: orig: " + string_orig + "\n";
            std::vector<FriBidiChar> fribidi_in_char(MAX_STR_LEN);
            const FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(FRIBIDI_CHAR_SET_UTF8, string_orig.c_str(),
                                                                           string_orig.size(), fribidi_in_char.data());
            assert(fribidi_len < MAX_STR_LEN);
            fribidi_in_char.resize(fribidi_len);
            fprintf(stderr, "len is %i\n", fribidi_len);
            //
            FriBidiCharType fribidi_pbase_dir = FRIBIDI_TYPE_LTR;
            std::vector<FriBidiChar> fribidi_visual_char(fribidi_len + 1);
            const fribidi_boolean stat = fribidi_log2vis(fribidi_in_char.data(), fribidi_len, &fribidi_pbase_dir,
                                                   fribidi_visual_char.data(), NULL, NULL, NULL);
            fprintf(stderr, "stat is: %d\n", stat);
            //
            if (stat) {
                    //
                    std::string string_formatted_ptr(MAX_STR_LEN, 0);
                    const FriBidiStrIndex new_len = fribidi_unicode_to_charset(FRIBIDI_CHAR_SET_UTF8, fribidi_visual_char.data(),
                                                                               fribidi_len, string_formatted_ptr.data());
                    assert(new_len < MAX_STR_LEN);
                    string_formatted_ptr.resize(new_len);
                    fprintf(stderr, "new_len is: %d\n", new_len);
                    //
                    return string_formatted_ptr;
            }
            return string_orig;
    }
    
    int main() {
            const std::string orig = "אריק איינשטיין";
            std::cerr << "main: orig: " << orig << "\n";
            const auto ret = fribidi_str_convert(orig);
            std::cerr << "main: transformed: " << std::setw(10 + orig.size()) << ret << "\n";
    }
    

    and outputs:

    $ g++ -lfribidi 1.cpp && ./a.out 
    main: orig: אריק איינשטיין
    dbg: orig: אריק איינשטיין
    len is 14
    stat is: 2
    new_len is: 27
    main: transformed:           ןייטשנייא קירא
    

    Knowing that FriBidiChar is uint32_t and fribidi internally uses UTF-32 and that wchar_t on Linux is UTF-32, it would be preferable to use std::wstring (or wchar_t) to know how much memory to allocate. You could also count codepoints in UTF-8 input string and then precalculate the length of UTF-8 represetation of fribidi_visual_char to allocate memory for fribidi_unicode_to_charset.