Search code examples
unixawkcomm

Remove Duplicates from Multiple files with Awk or similar


I have multiple 2-column, tab-separated files of different lengths in which I want to eliminate the duplicate values that are common to ALL files.

For example:

File 1:

9   1975
1518    a
5   a.m.
16  able
299 about
8   above
5   access

File 2:

6   a
6   abandoned
140 abby
37  able
388 about
17  above
6   accident

File 3:

5   10
8   99
23  1992
7   2002
29  237th
11  60s
8   77th
2175    a
5   a.m.
6   abandoned
32  able
370 about

File 4:

5   911
1699    a
19  able
311 about
21  above
6   abuse

The desired result is to have the items in Column 2 that are common to ALL files to be removed from each respective file. The desired result is the following:

File 1:

9   1975
5   a.m.
16  able
8   above
5   access

File 2:

6   abandoned
140 abby
37  able
17  above
6   accident

File 3:

5   10
8   99
23  1992
7   2002
29  237th
11  60s
8   77th
5   a.m.
6   abandoned
32  able

File 4:

5   911
19  able
21  above
6   abuse

Some of the standard methods to find duplicate values do not work for this task because I am trying to find those values that are duplicate to multiple files. Thus, something like comm or sort/uniq are not valid for this task. Is there a certain type of awk or other type of recursive tool that I can use to achieve my desired result?


Solution

  • Something like this (untested) will work if you can't have duplicated $2s within a file:

    awk '
    FNR==1 {
        if (seen[FILENAME]++) {
            firstPass = 0
            outfile = FILENAME "_new"
        }
        else {
            firstPass = 1
            numFiles++
            ARGV[ARGC++] = FILENAME
        }
    }
    firstPass { count[$2]++; next }
    count[$2] != numFiles { print > outfile }
    ' file1 file2 file3 file4
    

    If you can have duplicated $2s within a file it's a tweak to only increment count[$2] the first time $2 appears in each file, e.g.

    firstPass { if (!seen[FILENAME,$2]++) count[$2]++; next }