Search code examples
unixawkcomparecomparisonstring-comparison

Adding file name to the counted data


Suppose I have files similar to the following.

file 1

1,144931087,144931087,T,C  
16,89017167,89017167,C,G  
17,7330235,7330235,G,T  
17,10222478,10222478,C,T  

file 2

1,144931087,144931087,T,C
16,89017167,89017167,C,G
17,10222478,10222478,C,T

file 3

17,10222478,10222478,C,T  

I would like to find how many times duplicated values are present in each file, So ideally, the output would be like:

Output

2 1,144931087,144931087,T,C  
2 16,89017167,89017167,C,G  
3 17,10222478,10222478,C,T  
1 17,7330235,7330235,G,T 

I used the following command for counting the duplicates value.

sort Test1.csv Test2.csv Test3.csv | uniq --count

Now I wish to add the file name for the counted output. My desired output should look like this:

Test1 Test2 2 1,144931087,144931087,T,C  
Test1 Test2 2 16,89017167,89017167,C,G  
Test1 Test2 Test 3 3 17,10222478,10222478,C,T  
Test1 1 17,7330235,7330235,G,T  

Can anyone help me to get the desired output or can anyone suggest me a better way to get my desired output?


Solution

  • Using awk. Sorry about my clever file naming scheme:

    $ awk '{
        a[$0]++                   # count hits
        b[$0]=b[$0] FILENAME " "  # store filenames
    }
    END {
        for(i in a)               
            print b[i] a[i],i     # output them
    }' foo bar baz
    foo bar 2 1,144931087,144931087,T,C
    foo bar 2 16,89017167,89017167,C,G
    foo bar baz 3 17,10222478,10222478,C,T
    foo 1 17,7330235,7330235,G,T
    

    UPDATED per comments:

    $ awk 'BEGIN {
        FS=OFS=","
    } 
    {
        a[$1 OFS $2 OFS $3 OFS $4]++ 
        b[$1 OFS $2 OFS $3 OFS $4]=b[$1 OFS $2 OFS $3 OFS $4] FILENAME "|"
        c[$1 OFS $2 OFS $3 OFS $4]=$0                      # keep the last record with 
    }                                                      # specific key combination 
    END { 
        for(i in a) 
            print b[i] "," a[i],c[i]  
    }' foo  bar baz
    foo|bar|,2,16,89017167,89017167,C
    foo|,1,17,7330235,7330235,G
    foo|bar|,2,1,144931087,144931087,T
    foo|bar|baz|,3,17,10222478,10222478,C