Search code examples
unixawkterminalgnu-coreutils

Print duplicate count without removing duplicates in Terminal


I am new to working with the Terminal on mac and have a large .tsv file that consists of a list of items, and two values next to it. I would like to be able to print the number of duplicates next to the first occurrence of the item Without removing additional data.

I am aware of cut -f 1 | sort | uniq -c but this removes a lot of valuable data I would like to keep for analysis. I'm reading about awk and grep but I think I could use a little help.

This is an example of the file I'm trying to process:

fruit   number  reference
apple   12  342
apple   13  345
apple   43  4772
banana  19  234
banana  73  3242
peach   131 53423
peach   234 3266
peach   242 324
peach   131 56758
peaches 29  2434

Ideally, the output would look something like this:

fruit   number  reference   fruit_count
apple   12  342 3
apple   13  345 
apple   43  4772    
banana  19  234 2
banana  73  3242    
peach   131 53423   4
peach   234 3266    
peach   242 324 
peach   131 56758   
peaches 29  2434    1

Is something like this even possible? I can get the desired output excel using formulas, but the file is too large and keeps crashing on me. Any help would be appreciated.

EDIT: Adding My current solution (that does not meet my requirements)

cut -f 1 fruitsample.txt | sort | uniq -c | sed -e 's/ *//' -e 's/ / /'

This gives me the intended counts, replacing the standard count + space output from uniq -c with a tab character, but it also sorts the header row and removes the 2nd and third columns.

On Excel, I can use the formula =IF(COUNTIF(A$2:A2,A2)=1,COUNTIF(A:A,A2),"") and fill it down. The file I'm working with is nearly 680K rows of data, and Excel chokes trying to calculate that many rows.

As I mentioned, I am a beginner looking for guidance. I'm just not as familiar with awk or grep. Thanks again!


Solution

  • awk to the rescue!

    awk 'NR==FNR {a[$1]++; next} 
         FNR==1  {print $0, "fruit_count"; next} 
         $1 in a {$(NF+1)=a[$1]; delete a[$1]}1' file{,} | 
    column -t
    
    fruit    number  reference  fruit_count
    apple    12      342        3
    apple    13      345
    apple    43      4772
    banana   19      234        2
    banana   73      3242
    peach    131     53423      4
    peach    234     3266
    peach    242     324
    peach    131     56758
    peaches  29      2434       1
    

    for explanation of the main idea I'll use a simpler structure without header, and unsorted data

    $ cat file
    apple
    banana
    apple
    apple
    cherry
    banana
    
    $ awk 'NR==FNR {a[$1]++; next}            # in the first pass, save key counts
                    $1 in a                   # if the key in map
                            {$(NF+1)=a[$1];   # add the count as a last column
                             delete a[$1]}    # remove key from map
                    1                         # print
           ' file{,} |                        # bash shorthand for: file file
      column -t                               # pretty print columns 
    
    
    apple   3
    banana  2
    apple
    apple
    cherry  1
    banana
    

    for the simplified example, using unix toolchain you can achieve the same with

    join -a1 -11 -22 -o1.2,2.1 <(cat -n file) <(cat -n file | sort -k2 | uniq -c -f1)
    

    adding header will require more juggling; it's where awk shines.