I'm processing a TSV file with a hundred columns and I would like to pick some of them based on their last values (ie. the value on the last row).
For instance with the following data (and I let you imagine 96 more columns!):
Year An Ti Gi Na
2000 5 10 2 3
2010 3 2 5 7
2020 6 3 5 6
I would like to select all columns whose value on the last row is at least 5 (and, better, I would like to select all columns whose value on the last row is at least the value of the row Gi).
So in the end, I would like the following output (the column Ti is discarded because its value on the last row is 3 which is below our threshold).
Year An Gi Na
2000 5 2 3
2010 3 5 7
2020 6 5 6
Do you have any ideas?
Thanks!
You could write this bash script
#!/bin/bash
gi_value=$(mlr --t2n tail -n 1 then cut -f Gi input.tsv)
gi_columns=$(mlr --t2n tail -n 1 then label 1 then reshape -r '^[a-zA-Z]' -o k,v then filter '$v >= '"$gi_value"'' then cut -f k then nest --ivar "," -f k input.tsv)
mlr --tsv cut -f Year,"$gi_columns" input.tsv
First extract the Gi value
mlr --t2n tail -n 1 then cut -f Gi input.tsv
to populate the gi_value
var with 5
value.
Then extract the list of columns of the last row in wich the value is >= 5
mlr --t2n tail -n 1 \
then label 1 \
then reshape -r '^[a-zA-Z]' -o k,v \
then filter '$v >= '"$gi_value"'' \
then cut -f k \
then nest --ivar "," -f k input.tsv
This gives you in output An,Gi,Na
.
The last step is to cut An,Gi,Na
columns:
mlr --tsv cut -f Year,"$gi_columns" input.tsv
The output is
Year An Gi Na
2000 5 2 3
2010 3 5 7
2020 6 5 6