Search code examples
javascriptarraysalgorithmsortinglocale

Locale based sort in Javascript, sort accented letters and other variants in a predefined way


In Finnish we sort W after V (as in English) but because W is not a native Finnish letter, it is considered as a variant of V, which is sorted as it were equal to V, but in cases where the only difference between two words is that V is W, then V-version is sorted first. An example enlights the proper order:

Vatanen, Watanen, Virtanen

In Finnish V and W are collated as A and Á. Á is sorted like A, but in cases where it is the only difference, the unaccented one comes first. The same rule is for all other accented letters, but the Å, Ä and Ö are collated separately after Z.

Question: What would be the best algorithm to sort this like variants in a predefined way? ( eg. [Watanen, Vatanen, Virtanen] to [Vatanen, Watanen, Virtanen] )?

Addition: The question is relevant to extend to cover also other variants in the way they are defined in http://cldr.unicode.org/index/cldr-spec/collation-guidelines, because the technique would in a great probability be the same and the answers to this question benefit the widest possible audience and sort algorithms can be made compatible with collation rules defined in Unicode CLDR. The Unicode CLDR defines three levels of differences between letters: primary level (base letters), secondary level (accented letters) and tertiary level (character case).

I have thought some kind of array preparation like in numerical sort where we could pad all numbers with zeros to make them comparaple as strings. An example: Array [file1000.jpg, file3.jpg, file22.jpg] can be prepared to make it comparable as strings by padding with zeros this way: [file1000.jpg, file0003.jpg, file0022.jpg]. Because of preparation of array, we can sort it very fast using native Array.sort().

The target language is Javascript, which lacks support for collation based sorts, so the custom sort function have to be made self. The algorithm is preferred, but if you have also code it's worth +1.


Solution

  • The usual approach to this problem is to use a list of mappings (normally the list doesn't need to be longer than three, and in your case two would do.) Each mapping maps a character onto a sequence point. [Note 3] So in your example,

     primary:      secondary:
      A -> 0         A -> 0
      Á -> 0         Á -> 1
      B -> 1         (irrelevant)
      C -> 2
      D -> 3
      E -> 4
    ...
      T -> 20
      U -> 21
      V -> 22        V -> 0
      W -> 22        W -> 1
      X -> 23
    ...
    

    The comparison algorithm essentially first translates each character in the words to using mapping1, and if they are not the same, uses that as the comparison. If they are the same, then it repeats using mapping2 (and so on).

    Not all languages are so simple, so there are a bunch of variations (for example, you might reverse the strings in pass 2).

    Note that you can achieve the same effect by making comparison keys consisting of the concatenation of the translations. If you do a lot of comparisons, caching this key can be a win. In that case, you would use a special value in the mappings other than the first mapping for "irrelevant". All irrelevant codes can be omitted, which often shortens the comparison key quite a bit.

    For example, in your example (but just upper case because it would be tedious to type the whole mapping sequence), we would translate VATANEN using the first mapping to [22, 1, 20, 1, 15, 5, 15] and with the second mapping to [0, 0, --, 0, --, --, --]. WATANEN would be [22, 1, 20, 1, 15, 5, 15] (exactly the same) with the first mapping, and [1, 0, --, 0, --, --, --] with the second. So dropping the --'s [Note 1], the comparison keys would be:

    VATANEN:  [22, 1, 20, 1, 15, 5, 15, 0, 0, 0]
    VÁTANEN:  [22, 1, 20, 1, 15, 5, 15, 0, 1, 0] (if there were such a place)
    WATANEN:  [22, 1, 20, 1, 15, 5, 15, 1, 0, 0]
    VIRTANEN: [22, 9, ...]
    

    This can be extended to more than two translation tables.

    For example, many applications want to do something like case-insensitive sorting, but where character case makes a difference if there are no other differences (in English, this usually means putting words with upper-case letter before words which are all lower-case, but both options are plausible.)

    So in the Finnish case, we could add a third translation table, where all upper case letters are translated to 0, all lower case letters are translated to 1, and all other characters are not translated. Some concatenated translations:

               -------primary---------  --2ary-  ------tertiary-----
    VÁTANEN:  [22, 1, 20, 1, 15, 5, 15, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    Vátenen:  [22, 1, 20, 1, 15, 5, 15, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]
    WATANEN:  [22, 1, 20, 1, 15, 5, 15, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    

    It is not at all obvious that this order is "correct". Nor, indeed, is it obvious what "correct" means for most languages, aside from those which have official linguistic authorities. [Note 2] So the above should just be considered as an example of multi-level encoding, not a definitive guide to alphabetic order. In this case, the tertiary code consists of just a single bit, although there may still be languages (like Dutch) in which there are three cases for a few letters.

    The above scheme does not contemplate digraphs and trigraphs, although they are reasonably easy to add, with care. (In the primary ordering, and possibly in secondary and tertiary as well, the digraph needs to have a single code for both characters.) Spanish, contrary to popular belief amongst non-Spanish programmers, has not been an example of this since 1994, almost twenty years ago, when the RAE decreed that 'ch' is alphabetized between 'cg' and 'ci', and not between 'c' and 'd' as it formerly was. I believe that some Dutch speakers still expect to find 'ij' and 'y' together, and Hungarians may still respect the complicated collection of digraphs and trigraphs that comprise their alphabet, but on the whole complicated mechanical schemes for alphabetical ordering are dying out, to be replaced with simple latin ordering, possibly complemented with secondary ordering for diacritics (French, Spanish, apparently Finnish, German dictionaries but not telephone books) or primary ordering of diacritics (Spanish ñ, Danish/Norwegian/Swedish vowels, Turkish).


    [Note 1]: It's not necessary to insert "irrelevant" secondary codes because the secondary part of the encoding is only consulted for pairs of words where the primary parts are identical. Since any letter considered irrelevant for the secondary coding would be so considered in all words in a primary equivalence class, it can just be omitted from the secondary coding. Similarly, it is legitimate to reuse codes in different primary equivalence classes, as we do above: [v, w] are [0, 1] and so are [a, á]. Obviously, there is no possibility of ambiguity. Consequently, secondary encodings can be quite short, both in sequence length and bit length.

    [Note 2]: English has no such body; the Spanish one is the Real Academia Española, but I couldn't find precise collation rules in any of the RAE's publications on my bookshelf, aside from the laconic observation that accents aren't considered in alphabetical order. However, the RAE's dictionary seems to consistently put unaccented words before any accented word with the same letters, at least in the two cases I could think of -- papa/papá and sabana/sábana.

    [Note 3] Of course, we need to keep track of the originals as well, so we have to somehow attach the comparison keys to the strings. As long as no two characters have the same translation in all of the mappings, this can be done with a simple hash table using the comparison key as a key.