Search code examples
localecollation

Letter-only collation (was: Weird file ordering in Emacs dired with my locale)


I just noticed. And this is creepy. But here's my screenshot. So help me, maybe!

TL;DR

The question's at the bottom.

Symptom

  -rw-r--r--  1 jb jb  24287 mars  21  2012 array.c
  -rw-r--r--  1 jb jb  28767 oct.   1  2014 arrayfunc.c
  -rw-r--r--  1 jb jb   2895 mai   11  2012 arrayfunc.h
  -rw-rw-r--  1 jb jb   4030 mars  29  2009 array.h
-UUU:%%--F1  bash-4.3.30          6% L9     (Dired by name)---------------------
 

(This is an emacs -nw screenshot. Yes, my terminal is 6 lines tall. It makes the screenshots more to-the-point. The locale is French, and that's expected. It's not that different to English, just imagine there's a “may” instead of « mai » and the months are Capitalized and truncated to three characters)

In case you missed it, it's dired mode, the files are supposed to be sorted by name (says so in the modeline) yet array.c and array.h aren't together!

Panic

I was looking for array.c, had the cursor beneath so whoa dude where is it it was there a minute ago. Then I actually find it. Then I check the modeline. Then I go WTF I'm asking SO. Then I notice it's in French they'll never understand better take a new screenshot with LC_ALL=C.

But that fixed the problem.

(Yes, it really happened.)

So it's a locale thing

My locale is fr_FR.UTF-8

     $ ls ar*           |       $ LC_ALL=C ls ar*
     array.c            |       array.c          
     arrayfunc.c        |       array.h          
     arrayfunc.h        |       arrayfunc.c      
     array.h            |       arrayfunc.h      

(That's when I remove the tag and start wondering if anyone actually follows seriously)

Seems it's the norm

I'll spare you the arcane shell invocations, but the gist of it: of the 29 locales I've got installed here, all but three use the “weird” ordering. Those three are: C, C.UTF-8 and POSIX.

It goes without saying, but there's no harm in mentioning it anyway: the “weird” ordering disturbs me, but it makes sense in its own way: on this small sample set it orders lexicographically as usual, only ignoring the period. So arrayc < arrayf < arrayh.

Question

Why? WHY? WHY??? It's in every locale but C, so it's deliberate. What rule is this based on? Did someone in some committee erect and convict: “thou shalt not observe thy punctuation whilst collating”? There's probably some legitimate serious document where they say it's perfectly normal, here's why, right?

It's the first time in oh so many years that I notice.

It also ignores spaces, of course.

Bonus: It's the bash-4.3.30 tarball from gnu.org. Why are some files 0664 and others 0644? Keep answers to that in the comments.
Also: I'm not asking how to fix it. In case you hadn't noticed, I already fixed don't really need to fix it. Plus, this has dupes everywhere. What I'm asking is why.


Solution

  • ANSWER: The Unicode Consortium came to the conclusion that having a guaranteed sort order, regardless of 'variable' characters, was more important than including every character in the string.

    DETAILS: I believe the answer you're looking for resides in:

    Unicode Technical Standard #10: Unicode Collation Algorithm

    If I'm understanding it correctly, punctuation (among other things, like whitespace) is 'variable' among languages, and therefore to ensure an identical sort order across languages, 'variable' characters are given a very low 'weight' in sorting; frequently resolving to a weight of zero, and therefore having no effect on sorting at all.

    The UTS does indicate that the sorting can be customized per user.

    Unfortunately, most systems just go with the defaults, which leads to only a few collation definitions that give 'variable' characters equal weight; and no real support for users to tune the defaults so that they get UTF-8 sorting with punctuation and whitespace INCLUDED instead of EXCLUDED.

    If I follow the rational correctly, consider sorting names. In many cultures and languages, firstname is always given before lastname, and when reversed, the lastname is separated by punctuation from the firstname. In other cultures, the reverse is true.

    lastname, firstname
    lastname firstname
    

    and

    firstname lastname
    firstname, lastname
    

    To ensure that each list is always sorted in the same order, the punctuation is ignored.