Postgres (v11) counts the red heart ❤️ as two characters, and so on for other multibyte UTF-8 chars with selector units. Anyone know how I get postgres to count true characters and not the bytes?
For example, I would like both of the examples below should return 1.
select length('❤️') = 2 (Unicode: 2764 FE0F)
select length('🏃♂️') = 4 (Unicode: 1F3C3 200D 2642 FE0F)
UPDATE
Thank you to folks pointing out that postgres is correctly counting the Unicode code points and why and how this happens.
I don't see any other option other than pre-processing the emoji strings as bytes against a table of official Unicode character bytes, in Python or some such, to get the perceived length.
So one way to do this is to ignore all characters in the Variation Selector and decrement by 2 if you hit the General Punctuation range.
This could be converted into a postgres function.
python
"""
# For reference, these code pages apply to emojis
Name Range
Emoticons 1F600-1F64F
Supplemental_Symbols_and_Pictographs 1F900-1F9FF
Miscellaneous Symbols and Pictographs 1F300-1F5FF
General Punctuation 2000-206F
Miscellaneous Symbols 2600-26FF
Variation Selectors FE00-FE0F
Dingbats 2700-27BF
Transport and Map Symbols 1F680-1F6FF
Enclosed Alphanumeric Supplement 1F100-1F1FF
"""
emojis="🏃♂️🏃♂️🏃♂️🏃♂️🏃♂️🏃♂️🏃♂️" # true count is 7, postgres length() returns 28
true_count=0
for char in emojis:
d=ord(char)
char_type=None
if (d>=0x2000 and d<=0x206F) : char_type="GP" # Zero Width Joiner
elif (d>=0xFE00 and d<=0xFE0F) : char_type="VS" # Variation Selector
print(d, char_type)
if ( char_type=="GP") : true_count-=2
elif (char_type!="VS" ): true_count+=1
print(true_count)