I'm trying to write a function that will run the fribidi algorithm on a std::string
and return a reordered std::string
. I hope it to be safe enough for any std::string
, and in case something fails in the way, it can return the original std::string
.
I saw many examples online that use std::wstring
, but I wonder whether I can avoid this conversion. Here's my attempt (I may have forgotten some includes).
# fribidi-test.cpp
#include <cstring>
#include <iostream>
#include <string>
#include <stdio.h>
#define FRIBIDI_NO_DEPRECATED
#include <fribidi/fribidi.h>
std::string fribidi_str_convert(std::string string_orig) {
std::cerr << "dbg: orig: " + string_orig + "\n";
FriBidiChar fribidi_in_char;
FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(
FRIBIDI_CHAR_SET_UTF8,
string_orig.c_str(),
string_orig.size(),
&fribidi_in_char
);
fprintf(stderr, "len is %i\n", fribidi_len);
// https://github.com/fribidi/fribidi#api
// Let fribidi think about the main direction by it's own (https://stackoverflow.com/q/58166995/4935114)
FriBidiCharType fribidi_pbase_dir = FRIBIDI_TYPE_LTR;
// Prepare output variable
FriBidiChar fribidi_visual_char;
fribidi_boolean stat = fribidi_log2vis(
/* input */
&fribidi_in_char,
fribidi_len,
&fribidi_pbase_dir,
/* output */
&fribidi_visual_char,
NULL,
NULL,
NULL
);
fprintf(stderr, "stat is: %d\n", stat);
if (stat) {
char string_formatted_ptr;
// Convert from fribidi unicode back to ptr
FriBidiStrIndex new_len = fribidi_unicode_to_charset(
FRIBIDI_CHAR_SET_UTF8,
&fribidi_visual_char,
fribidi_len,
&string_formatted_ptr
);
fprintf(stderr, "new_len is: %d\n", new_len);
if (new_len) {
fprintf(stderr, "string_formatted_ptr is: %s\n", &string_formatted_ptr);
std::string string_formatted_out(&string_formatted_ptr, new_len);
return string_formatted_out;
};
};
return string_orig;
};
int main() {
std::string orig = "אריק איינשטיין";
std::cerr << "main: orig: " + orig + "\n";
std::cerr << "main: transformed: " + fribidi_str_convert(orig) + "\n";
};
I compile and run it with:
g++ $(pkg-config --libs fribidi) fribidi-test.cpp -o fribidi-test && ./fribidi-test
My problem is that I'm getting a malformed output:
main: orig: ןייטשנייא קירא
dbg: orig: ןייטשנייא קירא
len is 14
stat is: 2
new_len is: 27
string_formatted_ptr is: אĐןייטשנייא קי
main: transformed: אĐןייטשנייא ק
That Đ
character is not supposed to be there. What I want to get is:
main: orig: ןייטשנייא קירא
dbg: orig: ןייטשנייא קירא
len is 14
stat is: 2
new_len is: 27
string_formatted_ptr is: אריק איינשטיין
main: transformed: אריק איינשטיין
Is this related to UTF16 encoding? and the fact that the new length is 27 - almost twice as that of the original length?
This is very wrong. You can't expect to store a string into one character.
Char is a char. It is not a pointer. Not a string. Remember to compile your programs with -fsanitize=undefined
and also check with valgrind.
FriBidiChar fribidi_in_char;
FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(....
&fribidi_in_char
);
char string_formatted_ptr;
FriBidiStrIndex new_len = fribidi_unicode_to_charset(...
&string_formatted_ptr
);
Also, };
- just use }
. There is no (need for) ;
after }
(in these cases).
It's cstdio
in C++.
Prefer to << string << string
instead of << string + string
to (I think) reduce memory allocations.
Fribidi API is bad, because I do not see how to calculate memory needed for the charset_to_unicode
. Even the fribidy
program - https://github.com/fribidi/fribidi/blob/cffa3047a0db9f4cd391d68bf98ce7b7425be245/bin/fribidi-main.c#L64 - just uses a constant amount of super big value. Also, fribidi program is the example that does not use std::wstring, because it is in C.
The following program uses a constant big buffer size like fribidi program:
#include <cassert>
#include <cstdio>
#include <cstring>
#include <iostream>
#include <string>
#include <string_view>
#include <vector>
#include <iomanip>
#define FRIBIDI_NO_DEPRECATED
#include <fribidi/fribidi.h>
#define MAX_STR_LEN 65000
std::string fribidi_str_convert(const std::string& string_orig) {
std::cerr << "dbg: orig: " + string_orig + "\n";
std::vector<FriBidiChar> fribidi_in_char(MAX_STR_LEN);
const FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(FRIBIDI_CHAR_SET_UTF8, string_orig.c_str(),
string_orig.size(), fribidi_in_char.data());
assert(fribidi_len < MAX_STR_LEN);
fribidi_in_char.resize(fribidi_len);
fprintf(stderr, "len is %i\n", fribidi_len);
//
FriBidiCharType fribidi_pbase_dir = FRIBIDI_TYPE_LTR;
std::vector<FriBidiChar> fribidi_visual_char(fribidi_len + 1);
const fribidi_boolean stat = fribidi_log2vis(fribidi_in_char.data(), fribidi_len, &fribidi_pbase_dir,
fribidi_visual_char.data(), NULL, NULL, NULL);
fprintf(stderr, "stat is: %d\n", stat);
//
if (stat) {
//
std::string string_formatted_ptr(MAX_STR_LEN, 0);
const FriBidiStrIndex new_len = fribidi_unicode_to_charset(FRIBIDI_CHAR_SET_UTF8, fribidi_visual_char.data(),
fribidi_len, string_formatted_ptr.data());
assert(new_len < MAX_STR_LEN);
string_formatted_ptr.resize(new_len);
fprintf(stderr, "new_len is: %d\n", new_len);
//
return string_formatted_ptr;
}
return string_orig;
}
int main() {
const std::string orig = "אריק איינשטיין";
std::cerr << "main: orig: " << orig << "\n";
const auto ret = fribidi_str_convert(orig);
std::cerr << "main: transformed: " << std::setw(10 + orig.size()) << ret << "\n";
}
and outputs:
$ g++ -lfribidi 1.cpp && ./a.out
main: orig: אריק איינשטיין
dbg: orig: אריק איינשטיין
len is 14
stat is: 2
new_len is: 27
main: transformed: ןייטשנייא קירא
Knowing that FriBidiChar
is uint32_t and fribidi internally uses UTF-32 and that wchar_t
on Linux is UTF-32, it would be preferable to use std::wstring
(or wchar_t
) to know how much memory to allocate. You could also count codepoints in UTF-8 input string and then precalculate the length of UTF-8 represetation of fribidi_visual_char
to allocate memory for fribidi_unicode_to_charset
.