Search code examples
mysqlstored-procedurescharacter-encodingstring-lengthiso-8859-1

MySQL - select first 10 bytes of a string


Hello wise men & women,

How would you select the first x bytes of a string?

The use case: I'm optimizing product description texts for upload to Amazon, and Amazon measures field lengths by bytes in utf8 (not latin1 as I stated earlier), not by characters. MySQL on the other hand, seems to operate character-based. (e.g., the function left() is character-based, not byte-based). The difference (using English, French, Spanish & German) is roughly 10%, but it can vary widely.

Some tests concerning a field with #bytes < 250 (details: http://wiki.devliegendebrigade.nl/Format_inventarisbestanden_(Amazon)#Veldlengte):

OK, char_length: 248,   byte length latin1: 248,   byte length utf8: 248
OK, char_length: 249,   byte length latin1: 249,   byte length utf8: 249
OK, char_length: 249,   byte length latin1: 249,   byte length utf8: 249
OK, char_length: 249,   byte length latin1: 249,   byte length utf8: 249

Not OK, char_length: 250,   byte length latin1: 250,   byte length utf8: 250
Not OK, char_length: 249,   byte length latin1: 249,   byte length utf8: 252
Not OK, char_length: 248,   byte length latin1: 248,   byte length utf8: 252
Not OK, char_length: 249,   byte length latin1: 249,   byte length utf8: 252
Not OK, char_length: 249,   byte length latin1: 249,   byte length utf8: 257

Illustration:

set @tekst="Jantje zag € pruimen hangen";

select
   char_length(@tekst),   # 27 characters
   length(@tekst);        # 29 bytes

select left(@tekst, 15)   # Result: "Jantje zag € pr"

# Ideally, I'm looking for something like this:

select left_bytes_utf8(@tekst, 15)   # Result: "Jantje zag € "

One approach would maybe be through a sproc that iteratively calls itself, but I suspect there are more efficient solutions around.

Thanks already & regards, Jeroen

P.s.: Edited the question: Changed 2x "latin1" to "utf8". It's actually a bit more confusing: Uploads should be in Latin1, but field sizes are measured in bytes using utf8

P.p.s: Update: These uploads are for English, French, Spanish & German Amazon-sites. Characters don't get more exotic than 'ø' (diameter), '€', 'è', 'é', 'ü' and 'ö'. All within Latin1-encoding, but multibyte in utf8.


Solution

  • SELECT CONVERT(LEFT(CONVERT(@tekst USING binary), 15) USING utf8);
    

    will give you the UTF-8 string cut down to 15 bytes as long as it is still a valid UTF-8 string (MySQL will refuse to give you an invalid string, for example if you cut on a multibyte character, and give you NULL instead.) If that doesn't work, you can get the raw bytes by omitting that last re-conversion to UTF-8, but you will have to decode them to something useful yourself:

    SELECT LEFT(CONVERT(@tekst USING binary), 15);
    

    However, Rick James gives a lot of good advice; though only you can judge the degree it is relevant to you, and your particular situation.