Search code examples
awksedediting

Finding Rows with Equal Values, Comparing their Columns and Removing the Smaller Value's Line


I have the following line of code:

grep -nP ';MULTIALLELIC' biallelic.output | sort -k2 | awk -F'[:;\t]' '{print $1,$3,$9,$13}'

It outputs:

2374 213 MID=212 GO=1
2462 213 MID=477 GO=137
2394 233 MID=232 GO=1
2464 233 MID=668 GO=1070
2185 24 MID=23 GO=1
2465 24 MID=752 GO=1083
2146 48 MID=354 GO=1010
1893 48 MID=47 GO=1
2219 58 MID=57 GO=1
2463 58 MID=595 GO=1057

I need to compare GO values based on the value found in the second column. Whichever row has a larger GO value I would like to remove that line number from the original file.


By adding awk '{print>$2}' I am able to separate the lines based on the value in column two but I am trying to avoid writing results to files.

What am I missing?

Edit: I am actually trying to remove those lines from biallelic.output, not just print what lines I want to remove. Sorry for the confusion.


Solution

  • This will compare GO values with each other and list the records with higher values compared to the minimum value.

    $ sed 's/GO=/& /' file | 
      sort -k2,2 -k5n      | 
      awk 'a[$2]++{if(!h) print h="Lines Removed From biallelic.output";
                   print $1}'
    
    Lines Removed From biallelic.output
    2462
    2464
    2465
    2146
    2463
    

    header will be conditionally printed if there is no value reported.

    splits the last field to separate the number from the prefix for sorting, group values by second field and sort by GO values numerically. The first for each group is the minimum, report all except the first for each group.

    to get the filtered output

    $ sed 's/GO=/& /' file | 
      sort -k2,2 -k5n      | 
      awk '!a[$2]++ {sub(/GO= /,"GO="); print}'
    
    2374 213 MID=212 GO=1
    2394 233 MID=232 GO=1
    2185 24 MID=23 GO=1
    1893 48 MID=47 GO=1
    2219 58 MID=57 GO=1