Search code examples
sortingcolumnsorting

Sorting on a column alphanumerically


I have following file and I want to sort it alphanumerically based on 6 th column such that an E1 is followed by I1 and then E2 and so on of a specific ID before the ' : ', when I do sort -V -k6 file it puts all the ID:Is at the end and not where they should be.However when I do sort -k6 it does put the Es and Is of the IDs together but with some IDs belonging to different series interspersed (I have highlighted them here), how can I get the sorting such that no two IDs are mixed and the column is in the order it should be:

chr1    259017  259121  104 -   ENSG00000228463:E2
chr1    259122  267095  7973    -   ENSG00000228463:I1
chr1    267096  267253  157 -   ENSG00000228463:E1
chr1    317720  317781  61  +   ENSG00000237094:E1
chr1    317782  320161  2379    +   ENSG00000237094:I1
chr1    320162  320653  491 +   ENSG00000237094:E2
chr1    320654  320880  226 +   ENSG00000237094:I2
chr1    320881  320938  57  +   ENSG00000237094:E3
chr1    320939  321031  92  +   ENSG00000237094:I3
chr1    321032  321290  258 +   ENSG00000237094:E4
chr1    321291  322037  746 +   ENSG00000237094:I4
chr1    322038  322228  190 +   ENSG00000237094:E5
chr1    322229  322671  442 +   ENSG00000237094:I5
chr1    322672  323073  401 +   ENSG00000237094:E6
chr1    323074  323860  786 +   ENSG00000237094:I6
chr1    323861  324060  199 +   ENSG00000237094:E7
chr1    324061  324287  226 +   ENSG00000237094:I7
chr1    324288  324345  57  +   ENSG00000237094:E8
chr1    324346  324438  92  +   ENSG00000237094:I8
chr1    324439  326514  2075    +   ENSG00000237094:E9
**chr1  326096  326569  473 +   ENSG00000250575:E1**
chr1    326515  327551  1036    +   ENSG00000237094:I9
**chr1  326570  327347  777 +   ENSG00000250575:I1**
**chr1  327348  328112  764 +   ENSG00000250575:E2**
chr1    327552  328453  901 +   ENSG00000237094:E10
chr1    328454  329783  1329    +   ENSG00000237094:I10
**chr1  329431  329620  189 -   ENSG00000233653:E2**
**chr1  329621  329949  328 -   ENSG00000233653:I1**
chr1    329784  329976  192 +   ENSG00000237094:E11

Solution

  • Original answer:

    sed 's/:[EI]/&_ /' foo.txt |  #separate the number at the end with a space
    sort -k6 | sort -n -k7 |         #sort by code, then by [EI] number
    sed 's/_ //'                  #remove the underscore space
    

    I like to do things like this by 'protecting' strings with a placeholder to isolate what I'm interested in, then replacing them later.

    Closer:

    sed 's/:[EI]/_ &_ /' foo.txt | sort -n -k8 | sort -k6,6 | sed 's/_ //g'
    

    But this naively assumes that sort works in a very specific way that it doesn't... so sometimes E2 will come before E1...

    I'm not sure it can be done with sort alone, awk might be the way to go...