Search code examples
mysqlstringperlutf8mb4

getting length of utf8mb4 string with perl from MySql


i wrote a small perl function that takes a string and checks its length without the spaces. the basic code looks as follows:

sub foo
{
   use utf8;
   my @wordsArray = split(/ /, $_[0]));
   my $result = length(join('', @wordsArray));
   return $result;
}

When i provide this function with a string containing special characters (such as hebrew letters), it seems to work great. the problem starts when i use a value coming from MySql column, with a character set of utf8mb4: in such case, the value being calculated is higher than the value in the previous example.

I can guess why such behavior occurs: the special characters are written in a 4 byte manner in the table, and thus each letter calculates as two characters in the utf8 encoding.

Does anyone know how the above can be resolved, so that i will get the right number of characters from a string coming from DB table defined as utf8mb4?

EDIT:

Some more information regarding the above code:

The DB column used as an argument for the function is of type VARCHAR(1000), with collation of utf8mb4_unicode_ci. I am fetching the row via a MySql connection configured as follows:

$mySql = DBI->connect(
  "DBI:mysql:$db_info{'database'}:$db_info{'hostname'};mysql_multi_statements=1;",
  "$db_info{'user'}",
  "$db_info{'password'}",
  {'RaiseError' => 1,'AutoCommit' => 0});
...
$mySql->do("set names utf8mb4");

an example data value would be "שלום עולם" (which in hebrew means "Hello World").

1) When calling foo($request->{VALUE}); (where VALUE is the column data from DB), the result is 16 (where each hebrew character is counted as two characters, and one space between them is disregarded). Dumper in this case is:

$VAR1 = "\327\251\327\234\327\225\327\235 \327\242\327\225\327\234\327\235";

2) When calling foo("שלום עולם");:

  • when declaring use utf8;, the result is 8 (as there are 8 visible characters in this string). Dumper (Useqq=1) in this case is:

    $VAR1 = "\x{5e9}\x{5dc}\x{5d5}\x{5dd} \x{5e2}\x{5d5}\x{5dc}\x{5dd}";

  • when not declaring `use utf8;', the result is 16, and is similar to the case of sending the value from DB:

    $VAR1 = "\327\251\327\234\327\225\327\235 \327\242\327\225\327\234\327\235";

Looks like i need to find a way of converting the received value to UTF8 before starting to work with it.


Solution

  • What MySQL calls utf8 is a limited subset of UTF-8 which allows only three bytes per character and covers code points up to 0xFFFF. Even utf8mb4 doesn't cover the full UTF-8 range, which supports encoded characters up to 6 bytes long

    The consequence is that any data from either a utf8 or a utf8mb4 column is simply a UTF-8 string in Perl and there should be no difference between the two database encodings

    It is my guess that you haven't enabled UTF-8 for your DBI handle, so everything is being treated as just a sequence of bytes. You should enable the mysql_enable_utf8 when you make the connect call, which should then look something like

    my $dbh = DBI->connect($dsn, $user, $password, { mysql_enable_utf8 => 1 });
    

    With the additional data, I can see that the string you are retrieving from the database is indeed שלום עולם UTF-8-encoded

    However, if I decode it, then first of all I get a non-space character count of 8 from both your foo subroutine and my own, not 9; and also you should be getting characters back from the database, not bytes

    I suspect that you may have written an encoded string to the database in the first place. Here's a short program that creates a MySQL table, writes two records to it (one character string and one encoded string) and retrieves what it has written. You will see that The only thing that makes a difference is the setting of mysql_enable_utf8. The behaviour is the same whether or not the original string is encoded, and with or without SET NAMES utf8mb4

    Further experimentation showed that either mysql_enable_utf8 or SET NAMES utf8mb4 will get DBI to write the data correctly, but the latter has no effect on reading

    I suggest that your solution should be to use ONLY mysql_enable_utf8 when either reading or writing

    You should also use utf8 only at the top of all your programs. Missing this out means you can't use any non-ASCII characters in your code

    use utf8;
    use strict;
    use warnings;
    
    use DBI;
    use open qw/ :std :encoding(utf-8) /;
    
    STDOUT->autoflush;
    
    my $VAR1 = "\327\251\327\234\327\225\327\235 \327\242\327\225\327\234\327\235";
    
    my $dbh = DBI->connect(
        qw/ DBI:mysql:database=temp admin admin /, {
            RaiseError => 1,
            PrintError => 0,
            mysql_enable_utf8 => 1,
        }
    ) or die DBI::errstr;
    
    $dbh->do('SET NAMES utf8mb4');
    
    $dbh->do('DROP TABLE IF EXISTS temp');
    $dbh->do('CREATE TABLE temp (value VARCHAR(64) CHARACTER SET utf8mb4)');
    
    my $insert = $dbh->prepare('INSERT INTO temp (value) VALUES (?)');
    $insert->execute('שלום עולם');
    $insert->execute($VAR1);
    
    my $values = $dbh->selectcol_arrayref('SELECT value FROM temp');
    printf "string: %s  foo: %d\n", $_, foo($_) for @$values;
    
    sub foo2 {
      $_[0] =~ tr/ //c;
    }
    
    sub foo {
      length join '', split / /, $_[0];
    }
    

    output with mysql_enable_utf8 => 1

    string: שלום עולם  foo: 8
    string: שלום עולם  foo: 8
    

    output with mysql_enable_utf8 => 0

    string: ש××× ×¢×××  foo: 16
    string: ש××× ×¢×××  foo: 16