Search code examples
jdbcutf-8coldfusionmariadbutf8mb4

UTF8mb4 unicode breaking MariaDB JDBC driver


I have some product names that include unicode characters

⚠️📷PLEASE READ! WORKING KODAK DC215 ZOOM 1.0MP DIGITAL CAMERA - UK SELLER

A query in heidiSQL shows it fine

enter image description here

I setup MariaDB new this morning having moved from MySQL, but when records are retrieved through a ColdFusion Query using the MariaDB JDBC I get

java.lang.StringIndexOutOfBoundsException: begin 0, end 80, length 74
    at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3410)
    at java.base/java.lang.String.substring(String.java:1883)
    at org.mariadb.jdbc.internal.com.read.resultset.rowprotocol.TextRowProtocol.getInternalString(TextRowProtocol.java:238)
    at org.mariadb.jdbc.internal.com.read.resultset.SelectResultSet.getString(SelectResultSet.java:948)

The productname field collation is utf8mb4_unicode_520_ci, I've tried a few options. I've tried to set this at table and database level where it let me.

The JDBC connection string in ColdFusion admin is jdbc:mysql://localhost:3307/usedlens?useUnicode=true&characterEncoding=UTF-8

I note that the live production database where MariaDB was used from the beginning I don't have this trouble but the default charset is latin1, and the same record is the database as

????PLEASE READ! WORKING KODAK DC215 ZOOM 1.0MP DIGITAL CAMERA - UK SELLER


Solution

  • Here's how we've been stripping high ASCII characters while retaining any characters that may be salvaged:

    string function ASCIINormalize(string inputString=""){
        return createObject( 'java', 'java.text.Normalizer' ).normalize( javacast("string", arguments.inputString) , createObject( 'java', 'java.text.Normalizer$Form' ).valueOf('NFD') ).replaceAll('\p{InCombiningDiacriticalMarks}+','').replaceAll('[^\p{ASCII}]+','');
    }
    
    productname = ASCIINormalize(productname);
    
    /*
    Comparisons using java UDF versus reReplace regex:
    
    "ABC   Café ’test"  (note: High ASCII non-normal whitespace characters used.)
       ASCIINormalize = "ABC Cafe test"
       reReplace = "ABC Caf test"
    
    "čeština"
       ASCIINormalize = "cestina"
       reReplace = "etina"
    
    "Häuser Bäume Höfe Gärten"
       ASCIINormalize = "Hauser Baume Hofe Garten"
       reReplace = "Huser Bume Hfe Grten"
    */