I have a varchar
column that contains only ASCII symbols. I don't need to sort by this field, but I need to search it by full equality.
Default locale is en.UTF8
. Will I gain anything if I create this column with collate "C"
?
Yes, it makes a difference.
Even if you do not sort deliberately, there are various operations requiring sort steps internally (some aggregate functions, DISTINCT
, nested loop joins etc.).
Also, any index on the field has to sort values internally - and observe collation rules unless COLLATE "C"
applies (no collation rules).
For searches by full equality you'll want an index - which works either way (for equality), but it's faster overall without collation rules. Depending on the details of your use case, the effect may be negligible or substantial. The impact grows with the length of your strings. I ran a benchmark on a related case some time ago:
Also, there are more pattern matching options with locale "C". The alternative would be to create an index with the special varchar_pattern_ops
operator class.
Related:
Postgres 9.5 introduced performance improvements with a technique called "abbreviated keys", which ran into problems with some locales. So it was deactivated, except for the C
locale. Quoting The release notes of Postgres 9.5.2:
- Disable abbreviated keys for string sorting in non-
C
locales (Robert Haas)PostgreSQL 9.5 introduced logic for speeding up comparisons of string data types by using the standard C library function
strxfrm()
as a substitute forstrcoll()
. It now emerges that most versions of glibc (Linux's implementation of the C library) have buggy implementations ofstrxfrm()
that, in some locales, can produce string comparison results that do not matchstrcoll()
. Until this problem can be better characterized, disable the optimization in all non-C locales. (C
locale is safe since it uses neitherstrcoll()
norstrxfrm()
.)Unfortunately, this problem affects not only sorting but also entry ordering in B-tree indexes, which means that B-tree indexes on
text
,varchar
, orchar
columns may now be corrupt if they sort according to an affected locale and were built or modified under PostgreSQL 9.5.0 or 9.5.1. Users shouldREINDEX
indexes that might be affected.It is not possible at this time to give an exhaustive list of known-affected locales.
C
locale is known safe, and there is no evidence of trouble in English-based locales such asen_US
, but some other popular locales such asde_DE
are affected in most glibc versions.
The problem also illustrates where collation rules come in, generally.