Search code examples
bashawkcut

Awk $1 vs cut -f 1 - why am I getting different answers?


So this question is bugging me and I have a million other projects to get to so I was hoping to clear this up. Thus far I haven't been able to find an answer. It seems pretty simple. I used:

awk '$1' merged_counts.txt |sort|uniq -d|wc

and got 216 lines. However, that number is incorrect. If I use

more merged_counts.txt|cut -f 1|sort|uniq -d|wc

I get 271 lines, which is correct. If I use

awk '{print $1}' merged_counts.txt |sort|uniq -d|wc

I also get 271 lines, however, then I've also lost the rest of the fields. I cannot figure out why it is behaving this way for what seems to be an elementary thing. Thanks for any help/suggestions. Surely I must be overlooking something.

Example of file:

B3GALT1 72  128 65  124 87  118 102 117 38  106 87  115 27  20  89  30
AMY1A   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
PSENEN  654 459 648 462 508 399 537 532 696 460 625 473 621 322 633 434

The gene 'AMY1A' is one of those genes annotated on both DNA strands so it appears twice in my file.


Solution

  • I see in a comment you say I need to keep the entire line, but I need to filter for duplicates based only on the first field so let's start with that and lets further assume that your fields are separated by any white space and that you always want to print the first line when a duplicate occurs.

    The awk command you'd use then would be:

    awk '!seen[$1]++' file
    

    Now - update your question with a description, input, and output to tell us what else you need.