Search code examples
c++macosencodingpopenutf-16

Call popen() on a command with Chinese characters on Mac


I'm trying to execute a program on a file using the popen() command on a Mac. For this, I create a command of the form <path-to_executable> <path-to-file> and then call popen() on this command. Right now, both these two components are declared in a char*. I need to read the output of the command so I need the pipe given by popen().

Now it turns out that path-to-file can contain Chinese, Japanese, Russian and pretty much any other characters. For this, I can represent the path-to-file as wchar_t*. But this doesn't work with popen() because apparently Mac / Linux don't have a wide _wpopen() like Windows.

Is there any other way I can make this work? I'm getting the path-to-file from a data structure that can only give me wchar_t* so I have to take it from there and convert it appropriately, if needed.

Thanks in advance.

Edit:

Seems like one of those days when you just end up pulling your hair out.

So I tried using wcstombs, but the setlocale call failed for "C.UTF-8" and any of its permutations. Unsurprisingly, the wcstombs call failed returning -1 after that.

Then I tried to write my own iconv implementation based on some sample codes searched on Google. I came up with this, which stubbornly refuses to work:

iconv_t cd = iconv_open("UTF-8", "WCHAR_T");
// error checking here

wchar_t* inbuf = ...; // get wchar_t* here
char outbuf[<size-of-inbuf>*4+1];

size_t inlen  = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;

char* c_inbuf  = (char*) inbuf;
char* c_outbuf = outbuf;

int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here

iconv always returns -1 and the errno is set to EINVAL. I've verified that <size-of-len> is set correctly. I've got no clue why this code's failing now.

Edit 2:

iconv was failing because I was not setting the input buffer length right. Also, Mac doesn't seem to support the "WCHAR_T" encoding so I've changed it to UTF-16. Now I've corrected the length and changed the from encoding but iconv just returns without converting any character. It just returns 0.

To debug this issue, I even changed the input string to a temp string and set the input length appropriately. Even this iconv call just returns 0. My code now looks like:

iconv_t cd = iconv_open("UTF-8", "UTF-16");
// error checking here

wchar_t* inbuf = ...; // get wchar_t* here - guaranteed to be UTF-16
char outbuf[<size-of-inbuf>*4+1];

size_t inlen  = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;

char* c_inbuf  = "abc"; // (char*) inbuf;
inlen = 4;
char* c_outbuf = outbuf;

int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here

I've confirmed that the converter descriptor is being opened correctly. The from-encoding is correct. The input buffer contains a few simple characters. Everything is hardcoded and still, iconv doesn't convert any characters and just returns 0 and outbuf remains empty.

Sanity loss alert!


Solution

  • You'll need an UTF-8 string for popen. For this, you can use iconv to convert between different encodings, including from the local wchar_t encoding to UTF-8. (Note that on my Mac OS install, wchar_t is actually 32 bits, and not 16.)

    EDIT Here's an example that works on OS X Lion. I did not have problems using the wchar_t encoding (and it is documented in the iconv man page).

    #include <sys/param.h>
    #include <string.h>
    #include <iconv.h>
    #include <stdio.h>
    #include <errno.h>
    
    char* utf8path(const wchar_t* wchar, size_t utf32_bytes)
    {
        char result_buffer[MAXPATHLEN];
    
        iconv_t converter = iconv_open("UTF-8", "wchar_t");
    
        char* result = result_buffer;
        char* input = (char*)wchar;
        size_t output_available_size = sizeof result_buffer;
        size_t input_available_size = utf32_bytes;
        size_t result_code = iconv(converter, &input, &input_available_size, &result, &output_available_size);
        if (result_code == -1)
        {
            perror("iconv");
            return NULL;
        }
        iconv_close(converter);
    
        return strdup(result_buffer);
    }
    
    int main()
    {
        wchar_t hello_world[] = L"/éè/path/to/hello/world.txt";
    
        char* utf8 = utf8path(hello_world, sizeof hello_world);
        printf("%s\n", utf8);
        free(utf8);
        return 0;
    }
    

    The utf8_hello_world function accepts a wchar_t string with its byte length and returns the equivalent UTF-8 string. If you deal with pointers to wchar_t instead of an array of wchar_t, you'll want to use (wcslen(ptr) + 1) * sizeof(wchar_t) instead of sizeof.