Search code examples
csvawk

How to remove rows with similar data to keep only highest value in a specific column (tsv file) with awk in bash?


I have a very large .tsv file (80 GB) that I need to edit. It is made up of 5 columns. The last column represent a score. Some positions have multiple "Score" entries and I need to keep only the row for each position with the highest value.

For example, this position have multiple entries for each combination:

1   861265  C   A   0.071
1   861265  C   A   0.148
1   861265  C   G   0.001
1   861265  C   G   0.108
1   861265  C   T   0
1   861265  C   T   0.216
2   193456  G   A   0.006
2   193456  G   A   0.094
2   193456  G   C   0.011
2   193456  G   C   0.152
2   193456  G   T   0.003
2   193456  G   T   0.056

The desired output would look like this:

1   861265  C   A   0.148
1   861265  C   G   0.108
1   861265  C   T   0.216
2   193456  G   A   0.094
2   193456  G   C   0.152
2   193456  G   T   0.056

Doing it in python/pandas is not possible as the file is too large or takes too long. Therefore, I am looking for a solution using bash; in particular awk.

Thif input file has been sorted with the following command:

sort -t$'\t' -k1 -n -o sorted_file original_file

The command would basically need to:

  • compare the data from the first 4 columns in the sorted_file
  • if all of those are the same, then only the row with the highest value on column 5 should be printed onto the output file`. I am not very familiar with awk syntax. I have seen relatively similar questions in other forums, but I was unable to adapt it to my particular case. I have tried to adapt one of those solutions to my case like this:
awk -F, 'NR==1 {print; next} NR==2 {key=$2; next}$2 != key {print lastval; key = $2} {lastval = $0} END {print lastval}' sorted_files.tsv > filtered_file.tsv

However, the output file does not look like it should, at all. Any help would be very much appreciated.


Solution

  • You can try this approach. This also works on a non-sorted last column, only the first 4 columns have to be sorted.

    % awk 'NR>1&&str!=$1" "$2" "$3" "$4{print line; m=0}
           $5>=m{m=$5; line=$0}
           {str=$1" "$2" "$3" "$4} END{print line}' file
    1   861265  C   A   0.148
    1   861265  C   G   0.108
    1   861265  C   T   0.216
    2   193456  G   A   0.094
    2   193456  G   C   0.152
    2   193456  G   T   0.056
    

    Data

    % cat file
    1   861265  C   A   0.071
    1   861265  C   A   0.148
    1   861265  C   G   0.001
    1   861265  C   G   0.108
    1   861265  C   T   0
    1   861265  C   T   0.216
    2   193456  G   A   0.006
    2   193456  G   A   0.094
    2   193456  G   C   0.011
    2   193456  G   C   0.152
    2   193456  G   T   0.003
    2   193456  G   T   0.056