Search code examples
mysqlperlencodingsmppucs2

Perl issue when encoding mysql data from UTF-8 to UCS-2 for SMPP


I am trying to fetch UTF-8 accentuated characters "é" "ê" from mysql and convert them to UCS-2 when sending over SMPP. The data is stored as utf8_general_ci and I perform the following when opening the DB connection:

$dbh->{'mysql_enable_utf8'}=1;
$dbh->do("set NAMES 'utf8'");

If I test the sending part by hard coding the string value with "é" "ê" using data_encoding=8, it goes through perfectly. However if I comment out the first line and just use what comes from the DB, it fails. Also, if I try to send the characters using the DB and setting data_encoding=3, it also works fine, but then the "ê" would not appear, which is also expected. Here is what I use:

$fred = 'éêcole'; <-- If I comment out this line, the SMPP call fails
$fred = decode('utf-8', $fred);
$fred = encode('UCS-2', $fred);

$resp_pdu = $short_smpp->submit_sm(
        source_addr_ton => 0x00,
        source_addr_npi => 0x01,
        source_addr => $didnb,
        dest_addr_ton => 0x01,
        dest_addr_npi => 0x01,
        destination_addr => $number,
        data_coding => 0x08,
        short_message => $fred
) or do {
        Log("ERROR: submit_sm indicated error: " . $resp_pdu->explain_status());
        $success = 0;
};

The different values for the data_coding fields are the following: Meaning of "data_coding" field in SMPP

00000000 (0) - usually GSM7
00000011 (3) for standard ISO-8859-1
00001000 (8) for the universal character set -- de facto UTF-16

The SMPP provider's documentation also mentions that special characters should be handled via UCS-2: https://community.sinch.com/t5/SMS-365-enterprise-service/Handling-Special-Characters/ta-p/1137

How should I prepare the data that is coming out of the DB to make this SMPP call work?

I am using Perl v5.10.1

Thanks !


Solution

  • $dbh->{'mysql_enable_utf8'} = 1; is used to decode the values returned from the database, causing queries to return decoded text (strings of Unicode Code Points). It makes no sense to decode such a string. Go straight to the encode.

    my $s_ucp = "\xE9\xEA\x63\x6F\x6C\x65";  # éêcole
    # -or-
    use utf8; # Script is encoded using UTF-8.
    my $s_ucp = "éêcole";
    
    printf "%vX\n", $s_ucp;                  # E9.EA.63.6F.6C.65
    
    my $s_ucs2be = encode('UCS-2', $s_ucp);
    
    printf "%vX\n", $s_ucs2be;               # 0.E9.0.EA.0.63.0.6F.0.6C.0.65