I am trying to modify a file, that has 7
columns.
The input file example is :
1.txt
1 10 11 A L X3 -1.1
1 10 11 A L X1 1.1
1 13 21 A T X3 -2.1
3 11 12 A T X2 -3.1
3 11 12 K T X2 7.1
4 11 12 A T X7 -8.1
4 11 12 C T X7 -8.1
4 11 12 C T X7 11.1
I want to extract these lines that are sharing the first 5 columns, but differing in the last two and also the others which does not share the first 5 columns. And then, I want to keep the line with lowest value on the last column.
The expected output is:
1 10 11 A L X3 -1.1
1 13 21 A T X3 -2.1
3 11 12 A T X2 -3.1
3 11 12 K T X2 7.1
4 11 12 A T X7 -8.1
4 11 12 C T X7 -8.1
1st line
is here because it shares the first 5
column with the 2.line
in the 1.txt
file. And it has the lowest number on the last column (-1.1 < 1.1
and also for the last line, we keep the one with -8.1
as it is smaller than 11.1
), so we only keep it, and we keep the other lines that does not have identical first 5
fields with each other.
What I have tried is keeping the first 5
column as index
in awk
, but it only prints the unique ones, not the rest. And it does not pick for the line having a lowest number in the last column.
The code:
awk -F"\t" '!seen[$1,$2,$3,$4,$5]++' 1.txt
Its output:
1 10 11 A L X3 -1.1
1 10 11 A L X1 1.1
1 13 21 A T X3 -2.1
3 11 12 A T X2 -3.1
3 11 12 K T X2 7.1
4 11 12 A T X7 -8.1
4 11 12 C T X7 -8.1
4 11 12 C T X7 11.1
I cannot pick the lines that are sharing only the first 5
columns, that have the lowest value on last column.
Your help is appreciated!
awk '
{key = $1 FS $2 FS $3 FS $4 FS $5}
!(key in min) || $NF < min[key] {min[key] = $NF; line[key] = $0}
END {for (key in line) print line[key]}
' file
1 10 11 A L X3 -1.1
1 13 21 A T X3 -2.1
4 11 12 C T X7 -8.1
4 11 12 A T X7 -8.1
3 11 12 K T X2 7.1
3 11 12 A T X2 -3.1
Note the order of the output is indeterminate. You can always pipe the output to sort
, or use GNU awk and control the array traversal.
I just realized the line
array is completely unnecessary but will consume a lot of memory: The min
array holds the first 5 fields as the key and the 6th field as the value
awk '
{key = $1 FS $2 FS $3 FS $4 FS $5}
!(key in min) || $NF < min[key] {min[key] = $NF}
END {for (key in line) print key, min[key]}
' file
It might be taking so long due to swapping.