I'm sorting a file numerically with almost nine hundred lines. For others the commands
tr '\r' '\n' < myfile.txt | sort
or
tr '\r' '\n' < myfile.txt | sort -n
seems to do the trick, but for me, I don't get the output I want (only two hundred lines). I can see that all duplicate numbers are lost on my mac, and I get the terminal error "tr: illegal byte sequence".
What am I doing wrong, and why can't I figure out how to save the file? Can it have something to with the file having blank columns?
The file is here: dropbox.com/s/umzx64c5ix90l3y/Proteins.txt?dl=0
EDIT/CLARIFICATION:
When I've sorted all the lines numerically, I need to combinde the lines with identical number in a way so that new information is added to the upper line. Take for instance the lines with no 61:
Col 1 2 3 4 5 6 7 8 9 10 11
61 PTS... cyt 1bl.. 0,38 MONOMER homo-trimer FRUC... PER...Bac..
61 PTS... 3
becomes:
Col 1 2 3 4 5 6 7 8 9 10 11
61 PTS... cyt 1bl.. 0,38 MONOMER homo-trimer FRUC... PER...Bac.. 3
If there are info in both lines that overlap, I need the information from the upper line to be kept.
Thanks :)
Your file is not properly UTF-8-encoded, while your locale is most certainly set to UTF-8. Line 195 contains (invalid sequences are marked with <HEX>
):
1945 comM protection against fracitins/bacteriocins (found by comparison to spr genome according to H<CE>varstein 2006) integral membrane protein (H<CE>varstein) no model
Figure out what the encoding is and then either convert the file to proper encoding or change the locale to accomodate. Simply trying
env LC_ALL=C tr '\r' '\n' < Proteins.txt | sort -n
seems to work for me, giving 1021 lines.