Search code examples
postgresqlutf-8character-encoding

postgres: how to count multibyte emoji strings display length in UTF-8


Postgres (v11) counts the red heart ❤️ as two characters, and so on for other multibyte UTF-8 chars with selector units. Anyone know how I get postgres to count true characters and not the bytes?

For example, I would like both of the examples below should return 1.

select length('❤️') = 2 (Unicode: 2764 FE0F)

select length('🏃‍♂️') = 4 (Unicode: 1F3C3 200D 2642 FE0F)

UPDATE

Thank you to folks pointing out that postgres is correctly counting the Unicode code points and why and how this happens.

I don't see any other option other than pre-processing the emoji strings as bytes against a table of official Unicode character bytes, in Python or some such, to get the perceived length.


Solution

  • So one way to do this is to ignore all characters in the Variation Selector and decrement by 2 if you hit the General Punctuation range.

    This could be converted into a postgres function.

    python

    """
    # For reference, these code pages apply to emojis
    Name    Range
    Emoticons    1F600-1F64F
    Supplemental_Symbols_and_Pictographs     1F900-1F9FF
    Miscellaneous Symbols and Pictographs    1F300-1F5FF
    General Punctuation  2000-206F
    Miscellaneous Symbols    2600-26FF
    Variation Selectors  FE00-FE0F
    Dingbats     2700-27BF
    Transport and Map Symbols    1F680-1F6FF
    Enclosed Alphanumeric Supplement     1F100-1F1FF
    """
    emojis="🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️" # true count is 7, postgres length() returns 28
    true_count=0
    for char in emojis:
        d=ord(char)
        char_type=None 
        if (d>=0x2000 and d<=0x206F) : char_type="GP" # Zero Width Joiner
        elif (d>=0xFE00 and d<=0xFE0F) : char_type="VS" # Variation Selector
        print(d, char_type)
        if ( char_type=="GP") : true_count-=2
        elif (char_type!="VS" ):  true_count+=1
    print(true_count)