How to use fribidi with std::string?

I'm trying to write a function that will run the fribidi algorithm on a std::string and return a reordered std::string. I hope it to be safe enough for any std::string, and in case something fails in the way, it can return the original std::string.

I saw many examples online that use std::wstring, but I wonder whether I can avoid this conversion. Here's my attempt (I may have forgotten some includes).

# fribidi-test.cpp
#include <cstring>
#include <iostream>
#include <string>
#include <stdio.h>
#define FRIBIDI_NO_DEPRECATED
#include <fribidi/fribidi.h>

std::string fribidi_str_convert(std::string string_orig) {
    std::cerr << "dbg: orig: " + string_orig + "\n";
    FriBidiChar fribidi_in_char;
    FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(
        FRIBIDI_CHAR_SET_UTF8,
        string_orig.c_str(),
        string_orig.size(),
        &fribidi_in_char
    );
    fprintf(stderr, "len is %i\n", fribidi_len);
    // https://github.com/fribidi/fribidi#api
    // Let fribidi think about the main direction by it's own (https://stackoverflow.com/q/58166995/4935114)
    FriBidiCharType fribidi_pbase_dir = FRIBIDI_TYPE_LTR;
    // Prepare output variable
    FriBidiChar     fribidi_visual_char;
    fribidi_boolean stat = fribidi_log2vis(
        /* input */
        &fribidi_in_char,
        fribidi_len,
        &fribidi_pbase_dir,
        /* output */
        &fribidi_visual_char,
        NULL,
        NULL,
        NULL
    );
    fprintf(stderr, "stat is: %d\n", stat);
    if (stat) {
        char string_formatted_ptr;
        // Convert from fribidi unicode back to ptr
        FriBidiStrIndex new_len = fribidi_unicode_to_charset(
            FRIBIDI_CHAR_SET_UTF8,
            &fribidi_visual_char,
            fribidi_len,
            &string_formatted_ptr
        );
        fprintf(stderr, "new_len is: %d\n", new_len);
        if (new_len) {
            fprintf(stderr, "string_formatted_ptr is: %s\n", &string_formatted_ptr);
            std::string string_formatted_out(&string_formatted_ptr, new_len);
            return string_formatted_out;
        };
    };
    return string_orig;
};

int main() {
    std::string orig = "אריק איינשטיין";
    std::cerr << "main: orig: " + orig + "\n";
    std::cerr << "main: transformed: " + fribidi_str_convert(orig) + "\n";
};

I compile and run it with:

g++ $(pkg-config --libs fribidi) fribidi-test.cpp -o fribidi-test && ./fribidi-test

My problem is that I'm getting a malformed output:

main: orig: ןייטשנייא קירא
dbg: orig: ןייטשנייא קירא
len is 14
stat is: 2
new_len is: 27
string_formatted_ptr is: אĐןייטשנייא קי
main: transformed: אĐןייטשנייא ק

That Đ character is not supposed to be there. What I want to get is:

main: orig: ןייטשנייא קירא
dbg: orig: ןייטשנייא קירא
len is 14
stat is: 2
new_len is: 27
string_formatted_ptr is: אריק איינשטיין
main: transformed: אריק איינשטיין

Is this related to UTF16 encoding? and the fact that the new length is 27 - almost twice as that of the original length?

Solution

This is very wrong. You can't expect to store a string into one character. Char is a char. It is not a pointer. Not a string. Remember to compile your programs with -fsanitize=undefined and also check with valgrind.

FriBidiChar fribidi_in_char;
FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(.... 
    &fribidi_in_char
);
    char string_formatted_ptr;
    FriBidiStrIndex new_len = fribidi_unicode_to_charset(...
        &string_formatted_ptr
    );

Also, }; - just use }. There is no (need for) ; after } (in these cases).

It's cstdio in C++.

Prefer to << string << string instead of << string + string to (I think) reduce memory allocations.

Fribidi API is bad, because I do not see how to calculate memory needed for the charset_to_unicode. Even the fribidy program - https://github.com/fribidi/fribidi/blob/cffa3047a0db9f4cd391d68bf98ce7b7425be245/bin/fribidi-main.c#L64 - just uses a constant amount of super big value. Also, fribidi program is the example that does not use std::wstring, because it is in C.

The following program uses a constant big buffer size like fribidi program:

#include <cassert>
#include <cstdio>
#include <cstring>
#include <iostream>
#include <string>
#include <string_view>
#include <vector>
#include <iomanip>
#define FRIBIDI_NO_DEPRECATED
#include <fribidi/fribidi.h>

#define MAX_STR_LEN 65000

std::string fribidi_str_convert(const std::string& string_orig) {
        std::cerr << "dbg: orig: " + string_orig + "\n";
        std::vector<FriBidiChar> fribidi_in_char(MAX_STR_LEN);
        const FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(FRIBIDI_CHAR_SET_UTF8, string_orig.c_str(),
                                                                       string_orig.size(), fribidi_in_char.data());
        assert(fribidi_len < MAX_STR_LEN);
        fribidi_in_char.resize(fribidi_len);
        fprintf(stderr, "len is %i\n", fribidi_len);
        //
        FriBidiCharType fribidi_pbase_dir = FRIBIDI_TYPE_LTR;
        std::vector<FriBidiChar> fribidi_visual_char(fribidi_len + 1);
        const fribidi_boolean stat = fribidi_log2vis(fribidi_in_char.data(), fribidi_len, &fribidi_pbase_dir,
                                               fribidi_visual_char.data(), NULL, NULL, NULL);
        fprintf(stderr, "stat is: %d\n", stat);
        //
        if (stat) {
                //
                std::string string_formatted_ptr(MAX_STR_LEN, 0);
                const FriBidiStrIndex new_len = fribidi_unicode_to_charset(FRIBIDI_CHAR_SET_UTF8, fribidi_visual_char.data(),
                                                                           fribidi_len, string_formatted_ptr.data());
                assert(new_len < MAX_STR_LEN);
                string_formatted_ptr.resize(new_len);
                fprintf(stderr, "new_len is: %d\n", new_len);
                //
                return string_formatted_ptr;
        }
        return string_orig;
}

int main() {
        const std::string orig = "אריק איינשטיין";
        std::cerr << "main: orig: " << orig << "\n";
        const auto ret = fribidi_str_convert(orig);
        std::cerr << "main: transformed: " << std::setw(10 + orig.size()) << ret << "\n";
}

and outputs:

$ g++ -lfribidi 1.cpp && ./a.out 
main: orig: אריק איינשטיין
dbg: orig: אריק איינשטיין
len is 14
stat is: 2
new_len is: 27
main: transformed:           ןייטשנייא קירא

Knowing that FriBidiChar is uint32_t and fribidi internally uses UTF-32 and that wchar_t on Linux is UTF-32, it would be preferable to use std::wstring (or wchar_t) to know how much memory to allocate. You could also count codepoints in UTF-8 input string and then precalculate the length of UTF-8 represetation of fribidi_visual_char to allocate memory for fribidi_unicode_to_charset.