Search code examples
c++cinternationalization

Is there any replacements for strstr in C/C++ that can be used with multibyte strings?


I am trying to do strstr in C++ with Shift-JIS strings. But since the accepted answer here states that there could be false positives if the standard strstr is used, I couldn't just use the regular one in the standard library. Apparently Windows provides _mbsstr that does what I want, but I am targeting other platforms as well.

I tried to use gnulib as it also provides mbsstr but I couldn't get it to work as it requires autotools, and I am using cmake.

Is there anything else that achieves it?


Solution

  • You are correct: unlike UTF-8, the Shift-JIS encoding causes false positives for strstr and strchr for single byte characters.

    Here is a simplistic custom function for C:

    #include <string.h>
    
    char *sjis_strstr(const char *s1, const char *s2) {
        unsigned char c1, c2 = *s2++;
        if (c2 == '\0')
            return s1;
        size_t len2 = strlen(s2);
        while ((c1 = *s1++) != '\0') {
            if (c1 == c2 && !strncmp(s1, s2, len2))
                return (char *)(s1 - 1);
            if (*s1 == '\0')
                break;
            s1 += (c1 >= 0x81 && c1 <= 0x9F) || (c1 >= 0xE0 && c1 <= 0xFC);
        }
        return NULL;
    }