Search code examples
miller

Miller: selecting columns based on their last value


I'm processing a TSV file with a hundred columns and I would like to pick some of them based on their last values (ie. the value on the last row).

For instance with the following data (and I let you imagine 96 more columns!):

Year    An  Ti  Gi  Na
2000    5   10  2   3
2010    3   2   5   7
2020    6   3   5   6

I would like to select all columns whose value on the last row is at least 5 (and, better, I would like to select all columns whose value on the last row is at least the value of the row Gi).

So in the end, I would like the following output (the column Ti is discarded because its value on the last row is 3 which is below our threshold).

Year    An  Gi  Na
2000    5   2   3
2010    3   5   7
2020    6   5   6

Do you have any ideas?

Thanks!


Solution

  • You could write this bash script

    #!/bin/bash
    
    gi_value=$(mlr --t2n tail -n 1 then cut -f Gi input.tsv)
    
    gi_columns=$(mlr --t2n tail -n 1 then label 1 then reshape -r '^[a-zA-Z]' -o k,v then filter '$v >= '"$gi_value"'' then cut -f k then nest --ivar "," -f k input.tsv)
    
    mlr --tsv cut -f Year,"$gi_columns" input.tsv
    

    First extract the Gi value

    mlr --t2n tail -n 1 then cut -f Gi input.tsv
    

    to populate the gi_value var with 5 value.

    Then extract the list of columns of the last row in wich the value is >= 5

    mlr --t2n tail -n 1 \
    then label 1 \
    then reshape -r '^[a-zA-Z]' -o k,v \
    then filter '$v >= '"$gi_value"'' \
    then cut -f k \
    then nest --ivar "," -f k input.tsv
    

    This gives you in output An,Gi,Na.

    The last step is to cut An,Gi,Na columns:

    mlr --tsv cut -f Year,"$gi_columns" input.tsv
    

    The output is

    Year    An      Gi      Na
    2000    5       2       3
    2010    3       5       7
    2020    6       5       6