Search code examples
cutf-8mariadb

How do I ensure that a string containing uft8 characters is correctly written to mysql in my C program?


I'm trying to insert a string into my MariaDB database which contains a Unicode character which is greater than 128, namely 177, ±.

wchar_t wcs = L"INSERT INTO text(drawing, eID, txt) VALUES(9,14063,'\261\065\061\067\071')";

using wctombs:

int ret;
ret = wctombs(querybuffer, wcs, 60);

ret is -1, which apparently means that there is a wide character that does not correspond to a valid multibyte character.

I've modified my code as per the suggestions from John Bollinger (much appreciated):

while(txt[i])
{
  c=cleaveMControl(txt,&i,j);
  if(c){
    if(c<128)
      query[j++]=c;
    else{
      query[j++]=92;
      query[j++]=92;
      sprintf(query+j,"u00%x",c);
      j=strlen(query);
    }
  }
}
query[j++]=39;
query[j++]=41;
query[j++]=59;
query[j]=0;
mysql_query(sqlconnect,query);

This produces:

MariaDB [D8]> select * from text where eID=14063;
+---------+-------+------------+
| drawing | eID   | txt        |
+----------+-------+------------+
|       9 | 14063 | \u00b15179 |
+---------+-------+------------+

I'm expecting to see '±5179'.

I'm still battling with this. Thanks to John Bollinger, I can assign a utf8 character thus: wchar_t c: char uni[5]; uni[i++]=192+((c>>6)&15); uni[i++]=128+(c&63); This gives me the mysql string: INSERT INTO text(drawing, eID, txt) VALUES(9,14063,'\302\261\065\061\067\071');

There is no error, but the value inserted into the database is without the ± character: +----------+-------+------------+ | 9 | 14063 | 15179 | +----------+-------+------------+


Solution

  • How do I ensure that a string containing uft8 characters is correctly written to mysql in my C program?

    To begin, this is wrong:

    wchar_t wcs=L"INSERT INTO text(drawing, eID, txt) VALUES(9,14063,'\261\065\061\067\071')"
    

    Your compiler should be warning you about assigning a pointer to an integer (and a narrower one at that) without a cast. It should also be complaining here:

    ret=wctombs(querybuffer,wcs,60);
    

    ... about converting the other direction. If you are not getting such warnings then either turn up your compiler's warning level or get a better compiler.

    This definition of wcs would be more likely to work:

    const wchar_t *wcs = L"INSERT INTO text(drawing, eID, txt) VALUES(9,14063,'\261\065\061\067\071')";
    

    The main thing is that you need to declare wcs as a pointer, not an individual wchar_t. Adding const offers some protection against accidentally trying to modify the contents, which you must not do.


    But if you can assume at least C11, then

    • you can portably define a string that is encoded in-memory via UTF-8 by using a UTF-8 string literal.

    • in any string literal, you can use "universal character names" to refer to characters by their Unicode code point values.

    Putting that together, in C11 or later you can just say:

    const char *s = u8"INSERT INTO text(drawing, eID, txt) VALUES(9,14063,'\u00b1" u8"5171')";
    

    to get your wanted UTF-8 encoded bytes, regardless of any consideration of source and execution character sets.

    By the way, it's not necessary to split that into two literals as I did, but doing so helps to clarify that the universal character name \u00b1 does not continue on into the subsequent digits of the string. (Note: there are 8-digit universal character names, too, but they start with \U instead of \u).


    If you cannot assume at least C11 then you would probably still be better off skipping the wide string literal, instead inserting the UTF-8 encoding directly for the single character you need:

    const char *s = "INSERT INTO text(drawing, eID, txt) VALUES(9,14063,'\302\261" "5171')";
    

    That does rely on an ASCII- (and UTF8-)compatible encoding for the characters of the execution basic character set, but that's a relatively safe bet, and a weaker requirement than is needed for your wide-string variant to work.