Search code examples
postgresqlutf-8character-encoding

How exactly is the length of a Postgres VARCHAR determined (UTF-8)?


The PostgreSQL documentation is very specific that VARCHAR(n) refers to a type which can store strings up to n characters (not bytes) in length. What a "character" actually is depends on the selected encoding for the database (Postgres supports lots).

For the sake of simplicity here, let's assume we're working with a database with UTF-8 encoding. The question, then: If we specify a VARCHAR(n) type, we're allowed to store n "characters", as Postgres calls them, but what precisely constitutes a character, to Postgres? If we want to check, in another language, that some string will fit in a VARCHAR(n), what should we actually be checking?

Unpacking this a little: First, unsurprisingly, some UTF-8 codepoints require more than 1 byte to store. For example, 😀 requires 4 bytes, and has hex representation 0xf09f9880. This should probably just be considered 1 character.

The complication comes when we look at single characters which by represented with more than one UTF-8 codepoint. For example 🇦🇶 (the Antarctic flag) is comprised of two codepoints: U+1F1E6 🇦 and U+1F1F6 🇶. When these codepoints occur sequentially, they form a single visible character, the Antarctic flag. Does Postgres consider this as 1 character, for a Varchar? Or as 2?

From some testing, it appears a Postgres UTF-8 character is a codepoint. Trying to insert 😀 into a VARCHAR(1) works, trying to insert 🇦🇶 does not. Is this how it works? Or is it more complicated?


Solution

  • You can use the length() or, equivalently, the char_length() function to measure the length of a string. These function count the number of code points in the string, and that is also what is delimited by the type modifier of character varying. PostgreSQL counts the length of a combining character as the number of code points involved.