In our project we use full-text search feature of PostgreSQL for different languages (English, German, Greek, Ukrainian, etc) using simple
dictionary. On my local machine (macOS Mojave) I have PostgreSQL installed using Homebrew. After trying to upgrade PostgreSQL from version 9.5.23 to 11.10 (basically I just removed old version and installed it anew) I noticed that when I try to use to_tsvector
function to convert my text to tsvector
data type it stopped lower-casing the text when using non-Latin languages. It still lower-cases the text producing lower-cased lexemes if it's Latin-based (ASCII). The problem only appeared though after I had upgraded to version 11 (it still worked as expected after I had transitionally upgraded from 9.5.23 to 10.15).
Examples (on versions 11.10, 12.5, 13.1):
SELECT to_tsvector('simple', 'Офісний Менеджер');
'Менеджер':2 'Офісний':1
SELECT to_tsvector('simple', 'Ελληνικά');
'Ελληνικά':1
SELECT ts_lexize('simple', 'Менеджер');
{Менеджер}
SELECT ts_lexize('simple', 'Manager');
{manager}
This is how it was previously (on versions 9.5.23 and 10.15):
SELECT to_tsvector('simple', 'Офісний Менеджер');
'менеджер':2 'офісний':1
SELECT to_tsvector('simple', 'Ελληνικά')
'ελληνικά':1
SELECT ts_lexize('simple', 'Менеджер');
{менеджер}
SELECT ts_lexize('simple', 'Manager');
{manager}
I tried reading all the release notes but I couldn't see anything related to my problem. In the docs it is still stated that
The simple dictionary template operates by converting the input token to lower case...
so I assume there were no changes to the API itself.
I have a suspicion that in version 11 they changed some default configuration parameters which affect this but I couldn't find anything helpful.
Running \l
in psql shell gives me the following.
# 11.10
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------------------+-----------------+----------+---------+-------+-------------------------------------
postgres | dmytrosavochkin | UTF8 | C | C |
template0 | dmytrosavochkin | UTF8 | C | C | =c/dmytrosavochkin +
| | | | | dmytrosavochkin=CTc/dmytrosavochkin
template1 | dmytrosavochkin | UTF8 | C | C | =c/dmytrosavochkin +
| | | | |
# 9.5.23
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------------------+-----------------+----------+-------------+-------------+-------------------------------------
postgres | dmytrosavochkin | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | dmytrosavochkin | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/dmytrosavochkin +
| | | | | dmytrosavochkin=CTc/dmytrosavochkin
template1 | dmytrosavochkin | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/dmytrosavochkin +
| | | | | dmytrosavochkin=CTc/dmytrosavochkin
Okay, thanks to the comment of @jjanes I got it (thank you!).
The problem happened to be in the set up of Homebrew postgresql formulae. Basically, when I was installing [email protected] it was running initdb /usr/local/var/[email protected]
under the hood to initialize the cluster but when I switched to 11.10 it was running initdb --locale=C -E UTF-8 /usr/local/var/postgresql@10
setting C
as a default locale and I didn't notice it.
To fix this I removed the initial data by rm -rf /usr/local/var/postgresql@12
and then re-initialized it manually with initdb --locale=en_US.UTF-8 -E UTF-8 /usr/local/var/postgresql@12
. Now the default collate for all databases is en_US.UTF-8 and everything works exactly as before.