Search code examples
linuxsortingcsvgnu-sort

Trying to understand the sort utilty in linux


I have a file named a.csv. which contains

100008,3
10000,3
100010,5
100010,4
10001,6
100021,7

After running this command sort -k1 -d -t "," a.csv

The result is

10000,3
100008,3
100010,4
100010,5
10001,6
100021,7

Which is unexpected because 10001 should come first than 100010

Trying to understand why this happened from long time. but couldn't get any answers.

$ sort --version
sort (GNU coreutils) 8.13
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.

Solution

  • Some of the other responses have assumed this is a numeric sort vs dictionary sort problem. It isn't, as even sorting alphabetically the output given in the question is incorrect.

    The answer

    To get the correct sorting, you need to change -k1 to -k1,1:

    $ sort -k1,1 -d -t "," a.csv
    10000,3
    100008,3
    10001,6
    100010,4
    100010,5
    100021,7
    

    The reason

    The -k option takes two numbers, the start and end fields to sort (i.e. -ks,e where s is the start and e is the end). By default, the end field is the end of the line. Hence, -k1 is the same as not giving the -k option at all. To show this, compare:

    $ printf "1,a,1\n2,aa,2\n" | sort -k2 -t,
    1,a,1
    2,aa,2
    

    with:

    $ printf "1~a~1\n2~aa~2\n" | sort -k2 -t~
    2~aa~2
    1~a~1
    

    The first sorts a,1 before aa,2, while the second sorts aa~2 before a~1 since, in ASCII, , < a < ~.

    To get the desired behaviour, therefore, we need to sort only one field. In your case, that means using 1 as both the start and end field, so you specify -k1,1. If you try the two examples above with -k2,2 instead of -k2, you'll find you get the same (correct) ordering in both cases.

    Many thanks to Eric and Assaf from the coreutils mailing list for pointing this out.