Search code examples
sortingcommand-linegnu-coreutils

Why does coreutils sort give a different result when I use a different field delimiter?


When using sort on the command line, why does the sorted order depend on which field delimiter I use? As an example,

$ # The test file:
$ cat test.csv
2,az,a,2
3,a,az,3
1,az,az,1
4,a,a,4

$ # sort based on fields 2 and 3, comma separated.  Gives correct order.  
$ LC_ALL=C sort -t, -k2,3 test.csv 
4,a,a,4
3,a,az,3
2,az,a,2
1,az,az,1

$ # replace , by ~ as field separator, then sort as before.  Gives incorrect order.
$ tr "," "~" < test.csv | LC_ALL=C sort -t"~" -k2,3
2~az~a~2
1~az~az~1
4~a~a~4
3~a~az~3

The second case not only gets the ordering wrong, but is inconsistent between field 2 (where az < a) and field 3 (where a < az).


Solution

  • There is a mistake in -k2,3. That means that sort should sort starting at the 2nd field and ending at the 3rd field. That means that the delimiter between them is also part of what is to be sorted and therefore counts as character. That's why you encounter different sorts with different delimiters.

    What you want is the following:

    LC_ALL=C sort -t"," -k2,2 -k3,3 file
    

    And:

    tr "," "~" < file | LC_ALL=C sort -t"~" -k2,2 -k3,3
    

    That means sort should sort the 2nd field and is the 2nd field has dublicates sort the 3rd field.