Search code examples
bashuniquecomm

BASH comm command, but for multiple columns


I am looking for something similar to the bash command comm, that I can use to select entries both unique to my 2 files and common to them. Comm worked great when I had just one column per file, eg.

 comm -13 FILE1.txt FILE2.txt > Entries_only_in_file1.txt

But now I have multiple columns of info I wish to keep. I want to select column 2 as the one to filter rows for unique and common entries between my two files. If the entry in column two appears in both files I also want to record the info in columns 3,4,and 5 (if possible, this is not as important). Here is an example of input and output.

FILE1.txt
NM_023928   AACS    2   2   1
NM_182662   AADAT   2   2   1
NM_153698   AAED1   1   5   3
NM_001271   AAGAB   2   2   1


FILE2.txt
NM_153698   AAED1   2   5   3
NM_001271   AAGAB   2   2   1
NM_001605   AARS    3   40  37
NM_212533   ABCA2   3   4   2

Output wanted:

COMMON.txt
NM_153698   AAED1   1   5   3   2   5   3
NM_001271   AAGAB   2   2   1   2   2   1

UNIQUE_TO_1.txt
NM_023928   AACS    2   2   1
NM_182662   AADAT   2   2   1

UNIQUE_TO_2.txt
NM_001605   AARS    3   40  37
NM_212533   ABCA2   3   4   2

I know there has been similar questions before but I can't quite find what I'm looking for. Any ideas greatly appreciated, thank you.


Solution

  • join has the following options which are useful for your task:

    • -j FIELD: join on field FIELD
    • -o FORMAT: specify output format, as a comma separated list of FILENUM.FIELD.
    • -v FILENUM: output lines only on FILENUM.

    Common to both files:

    $ join -j2 -o 1.1,1.2,1.3,1.4,1.5,2.3,2.4,2.5 FILE1.txt FILE2.txt 
    NM_153698 AAED1 1 5 3 2 5 3
    NM_001271 AAGAB 2 2 1 2 2 1
    

    Unique to FILE1:

    $ join -j2 -v1 FILE1.txt FILE2.txt 
    AACS NM_023928 2 2 1
    AADAT NM_182662 2 2 1
    

    Unique to FILE2:

    $ join -j2 -v2 FILE1.txt FILE2.txt 
    AARS NM_001605 3 40 37
    ABCA2 NM_212533 3 4 2