Search code examples
linuxsortingawkunique

Picking the lines sharing certain columns ( but not all )


I am trying to modify a file, that has 7 columns. The input file example is :

1.txt
    1   10  11  A   L   X3  -1.1
    1   10  11  A   L   X1   1.1
    1   13  21  A   T   X3  -2.1
    3   11  12  A   T   X2  -3.1
    3   11  12  K   T   X2   7.1
    4   11  12  A   T   X7  -8.1
    4   11  12  C   T   X7  -8.1
    4   11  12  C   T   X7  11.1

I want to extract these lines that are sharing the first 5 columns, but differing in the last two and also the others which does not share the first 5 columns. And then, I want to keep the line with lowest value on the last column.

The expected output is:

    1   10  11  A   L   X3  -1.1
    1   13  21  A   T   X3  -2.1
    3   11  12  A   T   X2  -3.1
    3   11  12  K   T   X2   7.1
    4   11  12  A   T   X7  -8.1
    4   11  12  C   T   X7  -8.1

1st line is here because it shares the first 5 column with the 2.line in the 1.txt file. And it has the lowest number on the last column (-1.1 < 1.1 and also for the last line, we keep the one with -8.1 as it is smaller than 11.1), so we only keep it, and we keep the other lines that does not have identical first 5 fields with each other. What I have tried is keeping the first 5 column as index in awk, but it only prints the unique ones, not the rest. And it does not pick for the line having a lowest number in the last column. The code:

awk -F"\t" '!seen[$1,$2,$3,$4,$5]++' 1.txt 

Its output:

1   10  11  A   L   X3  -1.1
1   10  11  A   L   X1   1.1
1   13  21  A   T   X3  -2.1
3   11  12  A   T   X2  -3.1
3   11  12  K   T   X2   7.1
4   11  12  A   T   X7  -8.1
4   11  12  C   T   X7  -8.1
4   11  12  C   T   X7  11.1

I cannot pick the lines that are sharing only the first 5 columns, that have the lowest value on last column. Your help is appreciated!


Solution

  • awk '
        {key = $1 FS $2 FS $3 FS $4 FS $5} 
        !(key in min) || $NF < min[key] {min[key] = $NF; line[key] = $0} 
        END {for (key in line) print line[key]}
    ' file
    
        1   10  11  A   L   X3  -1.1
        1   13  21  A   T   X3  -2.1
        4   11  12  C   T   X7  -8.1
        4   11  12  A   T   X7  -8.1
        3   11  12  K   T   X2   7.1
        3   11  12  A   T   X2  -3.1
    

    Note the order of the output is indeterminate. You can always pipe the output to sort, or use GNU awk and control the array traversal.


    I just realized the line array is completely unnecessary but will consume a lot of memory: The min array holds the first 5 fields as the key and the 6th field as the value

    awk '
        {key = $1 FS $2 FS $3 FS $4 FS $5} 
        !(key in min) || $NF < min[key] {min[key] = $NF} 
        END {for (key in line) print key, min[key]}
    ' file
    

    It might be taking so long due to swapping.