Search code examples
postgresqldatabase-designlocale

Should I save ASCII-only varchar in UTF-8 or ASCII?


I have a varchar column that contains only ASCII symbols. I don't need to sort by this field, but I need to search it by full equality.

Default locale is en.UTF8. Will I gain anything if I create this column with collate "C"?


Solution

  • Yes, it makes a difference.

    Even if you do not sort deliberately, there are various operations requiring sort steps internally (some aggregate functions, DISTINCT, nested loop joins etc.).

    Also, any index on the field has to sort values internally - and observe collation rules unless COLLATE "C" applies (no collation rules).

    For searches by full equality you'll want an index - which works either way (for equality), but it's faster overall without collation rules. Depending on the details of your use case, the effect may be negligible or substantial. The impact grows with the length of your strings. I ran a benchmark on a related case some time ago:

    Also, there are more pattern matching options with locale "C". The alternative would be to create an index with the special varchar_pattern_ops operator class.

    Related:

    Postgres 9.5 introduced performance improvements with a technique called "abbreviated keys", which ran into problems with some locales. So it was deactivated, except for the C locale. Quoting The release notes of Postgres 9.5.2:

    • Disable abbreviated keys for string sorting in non-C locales (Robert Haas)

    PostgreSQL 9.5 introduced logic for speeding up comparisons of string data types by using the standard C library function strxfrm() as a substitute for strcoll(). It now emerges that most versions of glibc (Linux's implementation of the C library) have buggy implementations of strxfrm() that, in some locales, can produce string comparison results that do not match strcoll(). Until this problem can be better characterized, disable the optimization in all non-C locales. (C locale is safe since it uses neither strcoll() nor strxfrm().)

    Unfortunately, this problem affects not only sorting but also entry ordering in B-tree indexes, which means that B-tree indexes on text, varchar, or char columns may now be corrupt if they sort according to an affected locale and were built or modified under PostgreSQL 9.5.0 or 9.5.1. Users should REINDEX indexes that might be affected.

    It is not possible at this time to give an exhaustive list of known-affected locales. C locale is known safe, and there is no evidence of trouble in English-based locales such as en_US, but some other popular locales such as de_DE are affected in most glibc versions.

    The problem also illustrates where collation rules come in, generally.