I have a huge table tab separated like the one below: the first row is the subject list while the other rows are my counts.
KEGGAnnotation a b c d e f g h i l m n o p q r s t u v z w ee wr ty yu im
K01824 0 0 1 5 0 0 0 0 0 0 0 0 0 0 14 6 0 0 0 0 0 0 0 0 0 0 0
K03924 17302 15372 19601 18732 17180 18094 23560 20516 14280 24187 19642 20521 20330 20843 22948 17124 19557 18319 16608 19463 18334 21022 14325 10819 13342 16876 16979
K13730 0 0 1 5 0 0 0 0 0 0 0 0 0 0 14 6 0 0 0 0 0 0 0 0 0 0 0
K13735 5360 463 12516 7235 5051 2022 2499 2778 5392 1220 6460 9490 1169 6556 14862 9657 7360 6837 7810 4368 2186 12474 7810 9755 1401 12867 4431
K07279 0 0 1 5 0 0 0 0 0 0 0 0 0 0 14 6 0 0 0 0 0 0 0 0 0 0 0
K14194 4499 2216 2322 2031 2763 2219 704 1647 2536 876 2692 4196 687 2958 3207 2153 2266 1974 370 2867 1110 5372 3637 9828 2038 2812 3472
K11494 0 0 1 10 0 0 0 0 11 0 0 0 0 0 14 6 0 0 0 0 0 0 0 0 0 0 0
K03332 0 0 1 5 0 0 0 0 0 0 0 0 0 0 14 6 0 0 0 0 0 0 0 0 0 0 0
K01317 3 1 6 0 1 3 0 14 11 0 21 8 0 20 0 263 0 0 6 3 5 0 0 41 0 0 2
I would like to grep only the lines in which the counts >100 are present in at least 20% of the samples (= in at least 6 samples).
EX. sample Ko3924 will be grepped but not K03332.
increment the counter for values greater than the threshold. Print the lines if the counter is greater than the 20% of the fields checked. This will also print the header line.
awk '{c=0; for(i=2;i<=NF;i++) c+=($i>=100); if(c>=0.2*(NF-1)) print $0}' input