Search code examples
sqlpostgresqlunicodegrapheme

How does the SQL length function handle unicode graphemes?


Consider the following scenario where I have the string defined by \U00000045\U00000301.

1) https://www.fileformat.info/info/unicode/char/0045/index.htm
2) https://www.fileformat.info/info/unicode/char/0301/index.htm

Would a table constrained by varchar(1) treat it as a valid 1 character input. Or would it be rejected because it is considered a 2 character input?

How does SQL treat the length of strings with graphemes in them generally?


Solution

  • I probably look silly with this query, but still:

    t=# with c(u) as (values( e'\U00000045\U00000301'))
    select u, u::varchar(1), u::varchar(2),char_length(u), octet_length(u) from c;
     u | u | u | char_length | octet_length
    ---+---+---+-------------+--------------
     É | E | É |           2 |            3
    (1 row)
    

    edit

    t=# show server_encoding ;
     server_encoding
    -----------------
     UTF8
    (1 row)
    
    t=# \l+ t
                                            List of databases
     Name | Owner | Encoding | Collate | Ctype | Access privileges | Size  | Tablespace | Description
    ------+-------+----------+---------+-------+-------------------+-------+------------+-------------
     t    | vao   | UTF8     | C       | UTF-8 |                   | 51 MB | pg_default |
    (1 row)