Search code examples
linuxloopsmax

Return the largest value of multiple variables of a column


I have a file containing a lot of rows of information about exons. Using Example.bed as the example:

chr10   20284306        20284479        1       174     +       5703    ENSGALE00010000132      1       TP53BP1 ENSGALP00010000035
chr10   20284549        20284700        175     326     +       5703    ENSGALE00010000133      2       TP53BP1 ENSGALP00010000035
chr10   20284930        20285000        327     397     +       5703    ENSGALE00010000134      3       TP53BP1 ENSGALP00010000035
chr10   20285794        20285949        398     553     +       5703    ENSGALE00010000135      4       TP53BP1 ENSGALP00010000035
chr10   20286413        20286494        554     635     +       5703    ENSGALE00010000136      5       TP53BP1 ENSGALP00010000035
chr10   20287104        20287249        636     781     +       5703    ENSGALE00010000137      6       TP53BP1 ENSGALP00010000035
chr10   20288800        20288920        782     902     +       5703    ENSGALE00010000138      7       TP53BP1 ENSGALP00010000035
chr10   20291082        20291173        903     994     +       5703    ENSGALE00010000139      8       TP53BP1 ENSGALP00010000035
chr10   20292507        20292733        995     1221    +       5703    ENSGALE00010000140      9       TP53BP1 ENSGALP00010000035
chr10   20293756        20294968        1222    2434    +       5703    ENSGALE00010000141      10      TP53BP1 ENSGALP00010000035
chr10   20297840        20297959        2435    2554    +       5703    ENSGALE00010000142      11      TP53BP1 ENSGALP00010000035
chr10   20298428        20298664        2555    2791    +       5703    ENSGALE00010000143      12      TP53BP1 ENSGALP00010000035
chr10   20299186        20299231        2792    2837    +       5703    ENSGALE00010000144      13      TP53BP1 ENSGALP00010000035
chr10   20300344        20300436        2838    2930    +       5703    ENSGALE00010000145      14      TP53BP1 ENSGALP00010000035
chr10   20301528        20302017        2931    3420    +       5703    ENSGALE00010000146      15      TP53BP1 ENSGALP00010000035
chr10   20302931        20303083        3421    3573    +       5703    ENSGALE00010000147      16      TP53BP1 ENSGALP00010000035
chr10   20304366        20304655        3574    3863    +       5703    ENSGALE00010000148      17      TP53BP1 ENSGALP00010000035
chr10   20304984        20305163        3864    4043    +       5703    ENSGALE00010000149      18      TP53BP1 ENSGALP00010000035
chr10   20305369        20305790        4044    4465    +       5703    ENSGALE00010000150      19      TP53BP1 ENSGALP00010000035
chr10   20305894        20306085        4466    4657    +       5703    ENSGALE00010000151      20      TP53BP1 ENSGALP00010000035
chr10   20307935        20308138        4658    4861    +       5703    ENSGALE00010000152      21      TP53BP1 ENSGALP00010000035
chr10   20308476        20308688        4862    5074    +       5703    ENSGALE00010000153      22      TP53BP1 ENSGALP00010000035
chr10   20309196        20309290        5075    5169    +       5703    ENSGALE00010000154      23      TP53BP1 ENSGALP00010000035
chr10   20312888        20313087        5170    5369    +       5703    ENSGALE00010000155      24      TP53BP1 ENSGALP00010000035
chr10   20314252        20314397        5370    5515    +       5703    ENSGALE00010000156      25      TP53BP1 ENSGALP00010000035
chr10   20315012        20315199        5516    5703    +       5703    ENSGALE00010000157      26      TP53BP1 ENSGALP00010000035
chr15   11105492        11105605        1       114     -       1785    ENSGALE00010001178      1       NF2     ENSGALP00010000157
chr15   11097329        11097454        115     240     -       1785    ENSGALE00010001179      2       NF2     ENSGALP00010000157
chr15   11096311        11096433        241     363     -       1785    ENSGALE00010001180      3       NF2     ENSGALP00010000157
chr15   11094116        11094199        364     447     -       1785    ENSGALE00010001181      4       NF2     ENSGALP00010000157
chr15   11089697        11089765        448     516     -       1785    ENSGALE00010001182      5       NF2     ENSGALP00010000157
chr15   11089149        11089231        517     599     -       1785    ENSGALE00010001183      6       NF2     ENSGALP00010000157
chr15   11088372        11088447        600     675     -       1785    ENSGALE00010001184      7       NF2     ENSGALP00010000157
chr15   11087935        11088069        676     810     -       1785    ENSGALE00010001185      8       NF2     ENSGALP00010000157
chr15   11085667        11085741        811     885     -       1785    ENSGALE00010001186      9       NF2     ENSGALP00010000157
chr15   11084092        11084205        886     999     -       1785    ENSGALE00010001187      10      NF2     ENSGALP00010000157
chr15   11081594        11081716        1000    1122    -       1785    ENSGALE00010001188      11      NF2     ENSGALP00010000157
chr15   11080996        11081213        1123    1340    -       1785    ENSGALE00010001189      12      NF2     ENSGALP00010000157
chr15   11079528        11079633        1341    1446    -       1785    ENSGALE00010001190      13      NF2     ENSGALP00010000157
chr15   11074882        11075009        1447    1574    -       1785    ENSGALE00010001191      14      NF2     ENSGALP00010000157
chr15   11072986        11073145        1575    1734    -       1785    ENSGALE00010001192      15      NF2     ENSGALP00010000157
chr15   11065433        11065483        1735    1785    -       1785    ENSGALE00010001197      16      NF2     ENSGALP00010000157
chr15   10834671        10834875        180     384     +       2373    ENSGALE00010001718      3       SF3A1   ENSGALP00010000228
chr15   10835706        10835963        385     642     +       2373    ENSGALE00010001722      4       SF3A1   ENSGALP00010000228
chr15   10836411        10836485        643     717     +       2373    ENSGALE00010001726      5       SF3A1   ENSGALP00010000228
chr15   10836871        10837021        718     868     +       2373    ENSGALE00010001729      6       SF3A1   ENSGALP00010000228
chr15   10837464        10837657        869     1062    +       2373    ENSGALE00010001733      7       SF3A1   ENSGALP00010000228
chr15   10838311        10838428        1063    1180    +       2373    ENSGALE00010001734      8       SF3A1   ENSGALP00010000228
chr15   10838545        10838730        1181    1366    +       2373    ENSGALE00010001735      9       SF3A1   ENSGALP00010000228
chr15   10839088        10839209        1367    1488    +       2373    ENSGALE00010001736      10      SF3A1   ENSGALP00010000228

I want to find out the largest rank of the exons, which is the maximum value of column 9 for each protein Id (column 11), using tab-delimited to separate columns. Desire output:

ENSGALP00010000035  26
ENSGALP00010000157  16
ENSGALP00010000228  10

I was thinking of using a loop to do it, however, is there a more direct way to achieve it? For example, using sort and uniq together?


Solution

  • One easy way, using GNU datamash (Disclaimer: I'm a contributor):

    $ datamash groupby 11 max 9 < Example.bed
    ENSGALP00010000035      26
    ENSGALP00010000157      16
    ENSGALP00010000228      10
    

    (This does require that the file already be sorted on the value used for grouping like in your example; if not, add the -s option)

    Or with awk:

    $ awk 'function max(a,b) { if (a > b) { return a } else { return b } }
           { totals[$11] = max(totals[$11], $9) }
           END { for (id in totals) printf "%s\t%s\n", id, totals[id] }' Example.bed
    ENSGALP00010000157      16
    ENSGALP00010000228      10
    ENSGALP00010000035      26
    

    (Output in random order; can be piped through sort if needed)