postgres: how to count multibyte emoji strings display length in UTF-8

Postgres (v11) counts the red heart ❤️ as two characters, and so on for other multibyte UTF-8 chars with selector units. Anyone know how I get postgres to count true characters and not the bytes?

For example, I would like both of the examples below should return 1.

select length('❤️') = 2 (Unicode: 2764 FE0F)

select length('🏃‍♂️') = 4 (Unicode: 1F3C3 200D 2642 FE0F)

UPDATE

Thank you to folks pointing out that postgres is correctly counting the Unicode code points and why and how this happens.

I don't see any other option other than pre-processing the emoji strings as bytes against a table of official Unicode character bytes, in Python or some such, to get the perceived length.

Solution

So one way to do this is to ignore all characters in the Variation Selector and decrement by 2 if you hit the General Punctuation range.

This could be converted into a postgres function.

python

"""
# For reference, these code pages apply to emojis
Name    Range
Emoticons    1F600-1F64F
Supplemental_Symbols_and_Pictographs     1F900-1F9FF
Miscellaneous Symbols and Pictographs    1F300-1F5FF
General Punctuation  2000-206F
Miscellaneous Symbols    2600-26FF
Variation Selectors  FE00-FE0F
Dingbats     2700-27BF
Transport and Map Symbols    1F680-1F6FF
Enclosed Alphanumeric Supplement     1F100-1F1FF
"""
emojis="🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️" # true count is 7, postgres length() returns 28
true_count=0
for char in emojis:
    d=ord(char)
    char_type=None 
    if (d>=0x2000 and d<=0x206F) : char_type="GP" # Zero Width Joiner
    elif (d>=0xFE00 and d<=0xFE0F) : char_type="VS" # Variation Selector
    print(d, char_type)
    if ( char_type=="GP") : true_count-=2
    elif (char_type!="VS" ):  true_count+=1
print(true_count)