Search code examples
bashshellcsvawksh

find unique entries in one field across two CSV files with shell and friends


I have two CSV files I want to compare. The first looks like:

aaaaaaaa-fb34-4e3e-aed5-eec78d02d59b,1234
bbbbbbbb-f76a-4a05-bb53-63aba9d03fe2,5678
cccccccc-e351-4d8e-b44a-080f6ccdef7d,9012

The second looks like (only one field):

bbbbbbbb-f76a-4a05-bb53-63aba9d03fe2
cccccccc-e351-4d8e-b44a-080f6ccdef7d

I want to compare (with shell script and commands like comm, awk or others) these two sets based on the first field (the UUID in the example), showing full rows in the first file with UUIDs that do not appear in the second file. So, for the sample data above, I want to write myscript so that

$ myscript file1.csv file2.csv
aaaaaaaa-fb34-4e3e-aed5-eec78d02d59b,1234

If I just had two lists of UUIDs I could easily compare with comm but I'm not sure how to deal with the extra field. I guess I could cut off the second field, comm, and then loop over results and grep in the original two-field set, but this seems a little inefficent. Any better ways with standard utilities? I see also this question but that seems to discard rows / fields that are not common.


Solution

  • If awk is an option, you can use it to first read all lines of the second file as keys into an array, and then filter the first file by containedness of its first column among those array keys. Note the order of the files.

    awk -F, 'NR==FNR {a[$0]; next} !($1 in a)' file2.csv file1.csv
    
    aaaaaaaa-fb34-4e3e-aed5-eec78d02d59b,1234