Search code examples
bashmacossortingtrnumerical

numerical sorting but keeping duplicates


I'm sorting a file numerically with almost nine hundred lines. For others the commands

tr '\r' '\n' < myfile.txt | sort

or

tr '\r' '\n' < myfile.txt | sort -n

seems to do the trick, but for me, I don't get the output I want (only two hundred lines). I can see that all duplicate numbers are lost on my mac, and I get the terminal error "tr: illegal byte sequence".

What am I doing wrong, and why can't I figure out how to save the file? Can it have something to with the file having blank columns?

The file is here: dropbox.com/s/umzx64c5ix90l3y/Proteins.txt?dl=0

EDIT/CLARIFICATION:

When I've sorted all the lines numerically, I need to combinde the lines with identical number in a way so that new information is added to the upper line. Take for instance the lines with no 61:

Col   1     2      3     4       5       6         7         8      9   10    11
     61 PTS...  cyt   1bl..   0,38  MONOMER homo-trimer FRUC... PER...Bac..
     61 PTS...                                                                 3

becomes:

Col   1     2      3     4       5       6         7         8      9   10    11
     61 PTS...  cyt   1bl..   0,38  MONOMER homo-trimer FRUC... PER...Bac..   3

If there are info in both lines that overlap, I need the information from the upper line to be kept.

Thanks :)


Solution

  • Your file is not properly UTF-8-encoded, while your locale is most certainly set to UTF-8. Line 195 contains (invalid sequences are marked with <HEX>):

    1945    comM    protection against fracitins/bacteriocins (found by comparison to spr genome according to H<CE>varstein 2006)   integral membrane protein (H<CE>varstein)   no model
    

    Figure out what the encoding is and then either convert the file to proper encoding or change the locale to accomodate. Simply trying

    env LC_ALL=C tr '\r' '\n' < Proteins.txt | sort -n
    

    seems to work for me, giving 1021 lines.