Search code examples
c++node.jsraspberry-pinon-ascii-charactersembedded-v8

What is the safe way to create a v8::String from a wchar_t with non-ASCII characters?


I'm writing a Node.js frontend for a DAB development board, which will eventually run on a Raspberry Pi. I am a Java and web developer, and I'm struggling with C++ and converting between different types of strings.

The DAB board comes with a C++ SDK, with a number of handy functions. It allows me to get the number of available programs with GetTotalProgram(). For each program I can call GetProgramName to get the program's name:

GetProgramName(char mode, long dabIndex, char namemode, wchar_t * programName)

... where mode means FM or DAB, namemode means long or short name. The program´s name will be returned in programName.

In order to convert the wchar_t *programName into a v8::String, I found this snippet that I'm using, and understand the basics of:

  wchar_t buff[300];
  char cbuff[600];
  GetProgramName(0, i, 1, buff);
  wcstombs( cbuff, buff, wcslen(buff) );
  Local<String> str = String::NewFromUtf8(isolate, (const char *) cbuff, v8::String::kNormalString, wcslen(buff));

I iterate through the available programs and build up a v8::Array:

void GetPrograms(const FunctionCallbackInfo<Value>& args) {
  Isolate* isolate = Isolate::GetCurrent();
  HandleScope scope(isolate);

  wchar_t buff[300];
  char cbuff[600];
  int numberOfPrograms, i;

  numberOfPrograms = GetTotalProgram();
  Local<v8::Array> ARRAY = Array::New(isolate, totalprogram);

  for (i = 0; i < numberOfPrograms; i++) {
    if (GetProgramName(0, i, 1, buff)) {
      wcstombs( cbuff, buff, wcslen(buff) );
      Local<String> str = String::NewFromUtf8(isolate, (const char *) cbuff, v8::String::kNormalString, wcslen(buff));
      Local<Object> obj = Object::New(isolate);
      obj->Set(String::NewFromUtf8(isolate, "name"), str);
      ARRAY->Set(i, obj);
    }
  }
  args.GetReturnValue().Set(ARRAY);
}

I call the C++ method from my Node app:

var programs = ext.getPrograms();
for (var i = 0; i < programs.length; i++) {
  console.log(programs[i][name]);
}

This mostly works, but when the program's name contains a non ASCII-character, like Æ, Ø, Å, the next elements in ARRAY has a borked name.

Here's what the Node snippet actually outputs (console.log), compared to the expected output:

| ACTUAL    | EXPECTED   |
| --------- | ---------- |
| NRK SUPER | NRK SUPER  |
| NRK VUPER | NRK VÆR    |
| NRK P1 ER | NRK P1     |

It seems as though the non-ASCII character causes the next wcstombs to quit early, not copying the later characters.

Why does this happen? Is there a better way to create a v8::String from my wchar_t?

Note: I have now been able to isolate this problem down to the wcstombs method when running on the Raspberry Pi. The following code:

#include <stdio.h>
#include <string>
#include <cstring>
#include <cstdlib>

char cbuff[600];
wchar_t buff[300] = L"ABCø123abc";

int main( int argc, const char* argv[] ) {
    wcstombs( cbuff, buff, wcslen(buff) );
    wprintf(L"wcslen of wchar_t array: %u - strlen of char array: %u\n", (char) wcslen(buff), strlen(cbuff));
}

when run on a Mac, outputs
wcslen of wchar_t array: 10 - strlen of char array: 10,
but when run on the Raspberry, outputs
wcslen of wchar_t array: 10 - strlen of char array: 3 - that is, it counts only characters before the ø character

This looks similar to this unanswered question.


Solution

  • The problem was in the wcstombs( cbuff, buff, wcslen(buff) ) call, which would stop copying characters when it encountered a non-ASCII character. The docs say The behavior of this function depends on the LC_CTYPE category of the selected C locale.

    So setting the locale to a UTF-8 variant solved the problem:

    setlocale(LC_CTYPE, "C.UTF-8");
    

    Having done this, I can now create v8::Strings this way:

    wchar_t buff[300] = L"Something non-ASCII ÆØÅ here";
    char cbuff[600];
    wcstombs( cbuff, buff, wcslen(buff) );
    Local<String> str = String::NewFromUtf8(isolate, (const char *) cbuff, v8::String::kNormalString, wcslen(buff));