Search code examples
cstringutf-8multibyte-functions

Is it safe to use `strstr` to search for multibyte UTF-8 characters in a string?


Following my previous question: Why `strchr` seems to work with multibyte characters, despite man page disclaimer?, I figured out that strchr was a bad choice.

Instead I am thinking about using strstr to look for a single character (multi-byte not char):

const char str[] = "This string contains é which is a multi-byte character";
char * pos = strstr(str, "é"); // 'é' = 0xC3A9: 2 bytes 
printf("%s\n", pos);

Ouput:

é which is a multi-byte character

Which is what I expect: the position of the 1st byte of my multi-byte character.

A priori, this is not the canonical use of strstr but it seems to work well.
Is this workaround safe ? Can you think about any side-effect or special case that would cause a bug ?

[EDIT]: I should precise that I do not want to use wchar_t type and that strings I handle are UTF-8 encoded (I am aware this choice can be discussed but this an irrelevant debate)


Solution

  • With UTF-8

    UTF-8 is designed in such a way that it is immune to partial mismatch of character as shown above and cause any false positive. So it is completely safe to use strstr with UTF-8 coded multibyte characters.

    When used with other encodings

    strstr is not suitable for strings containing multi-byte characters.

    If you are searching for a string that doesn't contain multi-byte character inside a string that contains multi-byte character, it may give false positive. (While using shift-jis encoding in japanese locale, strstr("掘something", "@some") may give false positive)

    +---------+----+----+----+
    |   c1    | c2 | c3 | c4 |  <--- string
    +---------+----+----+----+
    
         +----+----+----+
         | c5 | c2 | c3 |  <--- string to search
         +----+----+----+
    

    If trailing part of c1 (accidentally) matches with c5, you may get incorrect result. I would suggest using unicode with unicode substring check function or multibyte substring check functions. (_mbsstr for example)