Search code examples
sqlpostgresqlcollation

Mysterious Postgres string comparison result when string contains certain symbol characters


My Postgres server gives me very confusing result while I try to compare strings which contains slash '/' or question mark '?'. For example in psql I ran:

select ('/' < '1') as c1,
       ('/1' < '1') as c2,
       ('/////1' < '1') as c3,
       ('/1' < '2') as c4,
       ('/1' < '11') as c5;

The result was:

 c1 | c2 | c3 | c4 | c5
----+----+----+----+----
 t  | f  | f  | t  | t

So '/' is smaller than '1', but '/1' is larger than '1'. In fact '/1' is between '1' and '2' and so is '/////1'. This did not follow lexicographical order.

However, '/1' was (correctly) smaller than 11 which made me even more confused.

I wanted to see if '/1' is considered escaped. So I ran:

select length('/1');

and I got 2, which meant that postgres does treat '/1' as a two character string.

The same problem happens when I replace / with other symbols such as $ or ?.

If you have docker, this problem is easily reproducible by runing a postgres in a docker container:

docker run postgres:11
docker exec  -it `docker ps | grep postgres:11 | cut -d' ' -f 1` psql -U postgres

Then try the above SQLs. I tried postgres 10 image and the behavior is the same.

The same thing happens to a real SQL when I compare a VARCHAR column with a string literal. This problem is driving me crazy since I need to write the correct SQL to compare file paths, which apparently contain many '/' symbols.

I searched and did not find any document talking about this so this doesn't look like a postgres 'official feature'. What is the right way to write a comparison that follows the lexicographical order?

Thanks a lot in advance.


Solution

  • Postgres uses the operating system's collation (on Linux that would be the ones provided by glibc). So your results depend on the underlying operating system.

    You can force an ASCCI comparison by using the "C" collation (as I did in the above examples):

    select '/1' > '1' collate "C"
    

    which seems to work the same on all platforms. Alternatively you can specify an ICU collation that will also work the same across all platforms.


    You mentioned you want to compare file paths. One way to do that only on the "names" (ignoring the delimiter") is to convert the path to an array string_to_array(filepath, '/') and then e.g. use that array for sorting or comparison.