Search code examples
bashsortingmultiple-columnscut

bug in bash sort with different columns?


I am working with a file that contains 3 values, an ID (they happen to be protein ids in case you are curious), a value, and then another value. It is tab delimited, so it looks like this:

A2M     0.979569315988908       1
AACS    0.925340159491081       1
AAGAB   0.982296215686199       1
AAK1    0.736903840140103       1
AAMP    0.00589711816127862     0.138868449447202
AARS2   1       1
AARS    3.13300124295614e-05    0.00212792325492566
AARSD1  0.527417792161261       1
AASDH   0.869909252023668       1
AASDHPPT        0.763918221284724       1
AATF    0.691907759125663       1
ABAT    0.989693691462661       1
ABCA1   0.601194017450064       1
ABCA5   1       1
ABCA6   1       1

I am interested in sorting these IDs in alphabetical order and extracting various values. However, I noticed that sort sorts the IDs differently, depending on what I am extracting. When I execute:

    cut --fields\=1,2 input.txt|sort --key=1

The resulting file is:

A2M     0.979569315988908
AACS    0.925340159491081
AAGAB   0.982296215686199
AAK1    0.736903840140103
AAMP    0.00589711816127862
AARS2   1
AARS    3.13300124295614e-05 
AARSD1  0.527417792161261
AASDH   0.869909252023668
AASDHPPT        0.763918221284724
AATF    0.691907759125663
ABAT    0.989693691462661
ABCA1   0.601194017450064
ABCA5   1
ABCA6   1

BUT When I execute:

cut --fields\=1,3 input.txt|sort --key=1

I get

A2M     1
AACS    1
AAGAB   1
AAK1    1
AAMP    0.138868449447202
AARS    0.00212792325492566
AARS2   1
AARSD1  1
AASDH   1
AASDHPPT        1
AATF    1
ABAT    1
ABCA1   1
ABCA5   1
ABCA6   1

Notice that the positions of AARS and AARS2 are switched, which they shouldn't be since I am just sorting based on the first column. I've never seen any behavior like this from sort, and I've been using bash for a while now. Is this a bug, or am I doing something wrong?


Solution

  • The --key=1 option tells sort to use all "fields" from the first through the end of the line to sort the input. As @rici observed first, by default this is a locale-sensitive sort, and in many locales whitespace is ignored for collation purposes. That's what seems to be happening here.

    If you want to sort only on the protein IDs, then that would be this:

    cut --fields=1,2 input.txt | sort --key=1,1
    cut --fields=1,3 input.txt | sort --key=1,1
    

    @rici explains how to approach the problem by specifying a collation order that accounts for whitespace.