Search code examples
javascriptnode.jslocale

localeCompare when testing strings sorted in en-us utf8


I want to use localeCompare to test strings being sorted via Postgres.

The collation that is being used is en_US.utf8

When I use localeCompare to test sorting in descending and ascending it is giving me the incorrect result. What locale can I pass to localeCompare to handle this properly?

For example:

Descending: "negative outcome".localeCompare("a sollicitudin orci") = 1

Ascending: "amet lorem semper auctor.".localeCompare("a sollicitudin orci") = 1


Solution

  • Unfortunately, there are no parameters you can pass to localeCompare to make it match Postgres' en_US.UTF-8 sorting.

    Postgres is following an implementation of the Unicode Collation Algorithm, which is documented here: http://www.unicode.org/reports/tr10/

    In contrast, localeCompare uses the "CompareStrings" operation from the Intl.Collator object. According to the spec, "The two Strings are compared in an implementation-defined fashion." (https://www.ecma-international.org/ecma-402/1.0/#CompareStrings). That spec suggests that implementations use the Unicode Collation Algorithm, but it is just a suggestion, and while I'm not sure exactly what different browsers are doing, I've done enough empirical testing on Chrome on Mac to be pretty sure that whatever it is doing, it is very different from Postgres' implementation.

    I'm currently not aware of any libraries that port the Unicode Sorting Algorithm to javascript.

    So. If you absolutely need a browser-side algorithm that exactly matches Postgres' sorting, and this is life or death, then I think your only option is to actually look at the spec (http://www.unicode.org/reports/tr10/), and possibly Postgres's source code, and port it to javascript.

    The spec is extremely dense and complex, so the pragmatic approach is probably to develop a good-enough algorithm that matches Postgres most of the time, and have your application handle the corner-cases gracefully. The most helpful resource for doing that I've found is this answer https://stackoverflow.com/a/3266430/534086 which provides a simple implementation of the algorithm using the Latin1 collation tables, which could likely be adapted to use UTF8.

    For my purposes, I haven't gone that route yet. I wrote a much rougher algorithm that: a) first strips out special characters such as spaces and ampersands from the two strings, and compares them using localeCompare with 'en-US', and b) to break ties, compares the original strings using localeCompare. This is extremely rough (I have a few test cases that I know it does not work on), but in practice it seems to yield the same results as Postgres for at least 90% of my real world usage.