So this question is bugging me and I have a million other projects to get to so I was hoping to clear this up. Thus far I haven't been able to find an answer. It seems pretty simple. I used:
awk '$1' merged_counts.txt |sort|uniq -d|wc
and got 216 lines. However, that number is incorrect. If I use
more merged_counts.txt|cut -f 1|sort|uniq -d|wc
I get 271 lines, which is correct. If I use
awk '{print $1}' merged_counts.txt |sort|uniq -d|wc
I also get 271 lines, however, then I've also lost the rest of the fields. I cannot figure out why it is behaving this way for what seems to be an elementary thing. Thanks for any help/suggestions. Surely I must be overlooking something.
Example of file:
B3GALT1 72 128 65 124 87 118 102 117 38 106 87 115 27 20 89 30
AMY1A 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
PSENEN 654 459 648 462 508 399 537 532 696 460 625 473 621 322 633 434
The gene 'AMY1A' is one of those genes annotated on both DNA strands so it appears twice in my file.
I see in a comment you say I need to keep the entire line, but I need to filter for duplicates based only on the first field
so let's start with that and lets further assume that your fields are separated by any white space and that you always want to print the first line when a duplicate occurs.
The awk command you'd use then would be:
awk '!seen[$1]++' file
Now - update your question with a description, input, and output to tell us what else you need.