I have a file such as
day1 aargh
day2 boom
day3 crack
day2 argh
and I want to sort it according to first key, but not any other keys, that is, I want to preserve order of lines where the key is the same.
I uexpected it will be as simple as
$ sort -k1,1 myfile
day1 aargh
day2 aargh
day2 boom
day3 crack
but whoops. as you can see, sort put the original line 4 before line 2 without any reason, throwing away the original order. (On day 2. "boom" was before "aargh"--not the other way. There were no 2 "aargh"s without "boom"! :)).
What I wanted was:
$ sort -k1,1 myfile
day1 aargh
day2 boom
day2 aargh
day3 crack
Why is that? Is that a bug? And more importantly, how to make sort behave the way I want?
You need to use this option:
-s, --stable
stabilize sort by disabling last-resort comparison
The last-resort comparison is a stringwise comparison of the entire line, used if all the keys are equal.
And the next time you have trouble with sort
(which you definitely will have more trouble if you keep using it; there are many non-intuitive things in it) try using --debug
to see what is being compared.
If you take just this line:
day2 aargh
and try sort --debug -k1,1
on it you should get this:
day2 aargh
____
__________
The input line is shown with a row of underscores under day2
. This means day2
is the highest-priority sort key for that line. It will be compared to the highest-priority sort key of the other lines to decide which one comes first. This key is included in the list of keys because of the -k1,1
.
The next row of underscores is under the whole line. That means the next sort key for the line in descending priority order is the whole line. If the -k1,1
key is exactly the same in a pair of lines, this is what will be compared next. This key is included in the list of keys because of the absence of -s
.
Try it again with -s -k1,1 --debug
and you'll see the second row of underscores is gone.
I can't think of an example where sort -k1,1
would behave differently from sort
with no options, since the whole-line comparison is going to start with the same bytes as the first-field comparison. But surely you can see that sort -k2,2
has a distinct meaning: first try the second field, then the whole line. So -k1,1
by itself is kind of a useless degenerate case.
As for why... the default behavior of sort
has included a last-resort whole-line comparison at least as far back as Version 6 UNIX - see the man page from 1975 which says
Lines that compare equal are ordered with all bytes significant.
(And there was no -s
option to disable it either!)
The strange default behavior of sort
is just a historical thing we have to live with because something that old and widely used can't have its defaults changed. Be thankful for GNU's --debug
option, a relatively late addition which arrived in 2010.