I have a file containing a lot of rows of information about exons. Using Example.bed
as the example:
chr10 20284306 20284479 1 174 + 5703 ENSGALE00010000132 1 TP53BP1 ENSGALP00010000035
chr10 20284549 20284700 175 326 + 5703 ENSGALE00010000133 2 TP53BP1 ENSGALP00010000035
chr10 20284930 20285000 327 397 + 5703 ENSGALE00010000134 3 TP53BP1 ENSGALP00010000035
chr10 20285794 20285949 398 553 + 5703 ENSGALE00010000135 4 TP53BP1 ENSGALP00010000035
chr10 20286413 20286494 554 635 + 5703 ENSGALE00010000136 5 TP53BP1 ENSGALP00010000035
chr10 20287104 20287249 636 781 + 5703 ENSGALE00010000137 6 TP53BP1 ENSGALP00010000035
chr10 20288800 20288920 782 902 + 5703 ENSGALE00010000138 7 TP53BP1 ENSGALP00010000035
chr10 20291082 20291173 903 994 + 5703 ENSGALE00010000139 8 TP53BP1 ENSGALP00010000035
chr10 20292507 20292733 995 1221 + 5703 ENSGALE00010000140 9 TP53BP1 ENSGALP00010000035
chr10 20293756 20294968 1222 2434 + 5703 ENSGALE00010000141 10 TP53BP1 ENSGALP00010000035
chr10 20297840 20297959 2435 2554 + 5703 ENSGALE00010000142 11 TP53BP1 ENSGALP00010000035
chr10 20298428 20298664 2555 2791 + 5703 ENSGALE00010000143 12 TP53BP1 ENSGALP00010000035
chr10 20299186 20299231 2792 2837 + 5703 ENSGALE00010000144 13 TP53BP1 ENSGALP00010000035
chr10 20300344 20300436 2838 2930 + 5703 ENSGALE00010000145 14 TP53BP1 ENSGALP00010000035
chr10 20301528 20302017 2931 3420 + 5703 ENSGALE00010000146 15 TP53BP1 ENSGALP00010000035
chr10 20302931 20303083 3421 3573 + 5703 ENSGALE00010000147 16 TP53BP1 ENSGALP00010000035
chr10 20304366 20304655 3574 3863 + 5703 ENSGALE00010000148 17 TP53BP1 ENSGALP00010000035
chr10 20304984 20305163 3864 4043 + 5703 ENSGALE00010000149 18 TP53BP1 ENSGALP00010000035
chr10 20305369 20305790 4044 4465 + 5703 ENSGALE00010000150 19 TP53BP1 ENSGALP00010000035
chr10 20305894 20306085 4466 4657 + 5703 ENSGALE00010000151 20 TP53BP1 ENSGALP00010000035
chr10 20307935 20308138 4658 4861 + 5703 ENSGALE00010000152 21 TP53BP1 ENSGALP00010000035
chr10 20308476 20308688 4862 5074 + 5703 ENSGALE00010000153 22 TP53BP1 ENSGALP00010000035
chr10 20309196 20309290 5075 5169 + 5703 ENSGALE00010000154 23 TP53BP1 ENSGALP00010000035
chr10 20312888 20313087 5170 5369 + 5703 ENSGALE00010000155 24 TP53BP1 ENSGALP00010000035
chr10 20314252 20314397 5370 5515 + 5703 ENSGALE00010000156 25 TP53BP1 ENSGALP00010000035
chr10 20315012 20315199 5516 5703 + 5703 ENSGALE00010000157 26 TP53BP1 ENSGALP00010000035
chr15 11105492 11105605 1 114 - 1785 ENSGALE00010001178 1 NF2 ENSGALP00010000157
chr15 11097329 11097454 115 240 - 1785 ENSGALE00010001179 2 NF2 ENSGALP00010000157
chr15 11096311 11096433 241 363 - 1785 ENSGALE00010001180 3 NF2 ENSGALP00010000157
chr15 11094116 11094199 364 447 - 1785 ENSGALE00010001181 4 NF2 ENSGALP00010000157
chr15 11089697 11089765 448 516 - 1785 ENSGALE00010001182 5 NF2 ENSGALP00010000157
chr15 11089149 11089231 517 599 - 1785 ENSGALE00010001183 6 NF2 ENSGALP00010000157
chr15 11088372 11088447 600 675 - 1785 ENSGALE00010001184 7 NF2 ENSGALP00010000157
chr15 11087935 11088069 676 810 - 1785 ENSGALE00010001185 8 NF2 ENSGALP00010000157
chr15 11085667 11085741 811 885 - 1785 ENSGALE00010001186 9 NF2 ENSGALP00010000157
chr15 11084092 11084205 886 999 - 1785 ENSGALE00010001187 10 NF2 ENSGALP00010000157
chr15 11081594 11081716 1000 1122 - 1785 ENSGALE00010001188 11 NF2 ENSGALP00010000157
chr15 11080996 11081213 1123 1340 - 1785 ENSGALE00010001189 12 NF2 ENSGALP00010000157
chr15 11079528 11079633 1341 1446 - 1785 ENSGALE00010001190 13 NF2 ENSGALP00010000157
chr15 11074882 11075009 1447 1574 - 1785 ENSGALE00010001191 14 NF2 ENSGALP00010000157
chr15 11072986 11073145 1575 1734 - 1785 ENSGALE00010001192 15 NF2 ENSGALP00010000157
chr15 11065433 11065483 1735 1785 - 1785 ENSGALE00010001197 16 NF2 ENSGALP00010000157
chr15 10834671 10834875 180 384 + 2373 ENSGALE00010001718 3 SF3A1 ENSGALP00010000228
chr15 10835706 10835963 385 642 + 2373 ENSGALE00010001722 4 SF3A1 ENSGALP00010000228
chr15 10836411 10836485 643 717 + 2373 ENSGALE00010001726 5 SF3A1 ENSGALP00010000228
chr15 10836871 10837021 718 868 + 2373 ENSGALE00010001729 6 SF3A1 ENSGALP00010000228
chr15 10837464 10837657 869 1062 + 2373 ENSGALE00010001733 7 SF3A1 ENSGALP00010000228
chr15 10838311 10838428 1063 1180 + 2373 ENSGALE00010001734 8 SF3A1 ENSGALP00010000228
chr15 10838545 10838730 1181 1366 + 2373 ENSGALE00010001735 9 SF3A1 ENSGALP00010000228
chr15 10839088 10839209 1367 1488 + 2373 ENSGALE00010001736 10 SF3A1 ENSGALP00010000228
I want to find out the largest rank of the exons, which is the maximum value of column 9 for each protein Id (column 11), using tab-delimited to separate columns. Desire output:
ENSGALP00010000035 26
ENSGALP00010000157 16
ENSGALP00010000228 10
I was thinking of using a loop to do it, however, is there a more direct way to achieve it? For example, using sort
and uniq
together?
One easy way, using GNU datamash
(Disclaimer: I'm a contributor):
$ datamash groupby 11 max 9 < Example.bed
ENSGALP00010000035 26
ENSGALP00010000157 16
ENSGALP00010000228 10
(This does require that the file already be sorted on the value used for grouping like in your example; if not, add the -s
option)
Or with awk
:
$ awk 'function max(a,b) { if (a > b) { return a } else { return b } }
{ totals[$11] = max(totals[$11], $9) }
END { for (id in totals) printf "%s\t%s\n", id, totals[id] }' Example.bed
ENSGALP00010000157 16
ENSGALP00010000228 10
ENSGALP00010000035 26
(Output in random order; can be piped through sort
if needed)