Search code examples
bashshellcsvunixcomm

Unix Compare Two CSV files using comm


I have two CSV files. 1.csv files has 718 entries and 2.csv has 68000 entries.

#cat 1.csv
#Num    #Name  
 1      BoB
 2      Jack
 3      John
 4      Hawk
 5      Scot
 ...........

#cat 2.csv
#Num #Name
1   BoB
2   John
3   Linda
4   Hawk
5   Scot
........

I knew how to compare two files,when only one column(Names) is available in both and to get the matching names.

#comm -12 <(sort 1.csv) <(sort 2.csv)

Now i would like to check, If Num in 1.csv is matching with Num in 2.csv, What is the associated "Names" from both the csv files for that matched Num ?

Result : 

1,Bob,Bob
2,Jack,John
3,John,Linda
4,Hawk,Hawk
5,Scot,Scot
..........

How to do achieve this using comm ?


Solution

  • You can use the join command to perform inner join on 2 csv files on the 1st field i.e the number. Here is an example:

    $ cat f1.csv 
    1      BoB
    2      Jack
    3      John
    4      Hawk
    5      Scot
    6      ExtraInF1
    $ cat f2.csv 
    1   BoB
    3   Linda
    4   Hawk
    2   John
    5   Scot
    7   ExtraInF2
    $ join <(sort -t ' ' -k 1 f1.csv) <(sort -t ' ' -k 1 f2.csv)
    1 BoB BoB
    2 Jack John
    3 John Linda
    4 Hawk Hawk
    5 Scot Scot
    $ join <(sort -t ' ' -k 1 f1.csv) <(sort -t ' ' -k 1 f2.csv) | tr -s ' ' ,
    1,BoB,BoB
    2,Jack,John
    3,John,Linda
    4,Hawk,Hawk
    5,Scot,Scot
    $
    

    Note I have added few dummy rows(number 6 and 7) and also note that they haven't appeared in the output as they aren't present in both files.

    <(sort -t ' ' -k 1 f1.csv) means process substitution i.e substitute the output of the process at this place. sort with delimiter as space(-t ' ') and on 1st key i.e 1st column(-k 1) and join by default performs inner join on 1st column of both files.