Hello wise men & women,
How would you select the first x bytes of a string?
The use case: I'm optimizing product description texts for upload to Amazon, and Amazon measures field lengths by bytes in utf8 (not latin1 as I stated earlier), not by characters. MySQL on the other hand, seems to operate character-based. (e.g., the function left() is character-based, not byte-based). The difference (using English, French, Spanish & German) is roughly 10%, but it can vary widely.
Some tests concerning a field with #bytes < 250 (details: http://wiki.devliegendebrigade.nl/Format_inventarisbestanden_(Amazon)#Veldlengte):
OK, char_length: 248, byte length latin1: 248, byte length utf8: 248
OK, char_length: 249, byte length latin1: 249, byte length utf8: 249
OK, char_length: 249, byte length latin1: 249, byte length utf8: 249
OK, char_length: 249, byte length latin1: 249, byte length utf8: 249
Not OK, char_length: 250, byte length latin1: 250, byte length utf8: 250
Not OK, char_length: 249, byte length latin1: 249, byte length utf8: 252
Not OK, char_length: 248, byte length latin1: 248, byte length utf8: 252
Not OK, char_length: 249, byte length latin1: 249, byte length utf8: 252
Not OK, char_length: 249, byte length latin1: 249, byte length utf8: 257
Illustration:
set @tekst="Jantje zag € pruimen hangen";
select
char_length(@tekst), # 27 characters
length(@tekst); # 29 bytes
select left(@tekst, 15) # Result: "Jantje zag € pr"
# Ideally, I'm looking for something like this:
select left_bytes_utf8(@tekst, 15) # Result: "Jantje zag € "
One approach would maybe be through a sproc that iteratively calls itself, but I suspect there are more efficient solutions around.
Thanks already & regards, Jeroen
P.s.: Edited the question: Changed 2x "latin1" to "utf8". It's actually a bit more confusing: Uploads should be in Latin1, but field sizes are measured in bytes using utf8
P.p.s: Update: These uploads are for English, French, Spanish & German Amazon-sites. Characters don't get more exotic than 'ø' (diameter), '€', 'è', 'é', 'ü' and 'ö'. All within Latin1-encoding, but multibyte in utf8.
SELECT CONVERT(LEFT(CONVERT(@tekst USING binary), 15) USING utf8);
will give you the UTF-8 string cut down to 15 bytes as long as it is still a valid UTF-8 string (MySQL will refuse to give you an invalid string, for example if you cut on a multibyte character, and give you NULL
instead.) If that doesn't work, you can get the raw bytes by omitting that last re-conversion to UTF-8, but you will have to decode them to something useful yourself:
SELECT LEFT(CONVERT(@tekst USING binary), 15);
However, Rick James gives a lot of good advice; though only you can judge the degree it is relevant to you, and your particular situation.