The PostgreSQL documentation is very specific that VARCHAR(n)
refers to a type which can store strings up to n
characters (not bytes) in length. What a "character" actually is depends on the selected encoding for the database (Postgres supports lots).
For the sake of simplicity here, let's assume we're working with a database with UTF-8 encoding. The question, then: If we specify a VARCHAR(n)
type, we're allowed to store n
"characters", as Postgres calls them, but what precisely constitutes a character, to Postgres? If we want to check, in another language, that some string will fit in a VARCHAR(n)
, what should we actually be checking?
Unpacking this a little: First, unsurprisingly, some UTF-8 codepoints require more than 1 byte to store. For example, 😀 requires 4 bytes, and has hex representation 0xf09f9880
. This should probably just be considered 1 character.
The complication comes when we look at single characters which by represented with more than one UTF-8 codepoint. For example 🇦🇶 (the Antarctic flag) is comprised of two codepoints: U+1F1E6 🇦 and U+1F1F6 🇶. When these codepoints occur sequentially, they form a single visible character, the Antarctic flag. Does Postgres consider this as 1 character, for a Varchar? Or as 2?
From some testing, it appears a Postgres UTF-8 character is a codepoint. Trying to insert 😀 into a VARCHAR(1)
works, trying to insert 🇦🇶 does not. Is this how it works? Or is it more complicated?
You can use the length()
or, equivalently, the char_length()
function to measure the length of a string. These function count the number of code points in the string, and that is also what is delimited by the type modifier of character varying
. PostgreSQL counts the length of a combining character as the number of code points involved.