I have multiple 2-column, tab-separated files of different lengths in which I want to eliminate the duplicate values that are common to ALL files.
For example:
File 1:
9 1975
1518 a
5 a.m.
16 able
299 about
8 above
5 access
File 2:
6 a
6 abandoned
140 abby
37 able
388 about
17 above
6 accident
File 3:
5 10
8 99
23 1992
7 2002
29 237th
11 60s
8 77th
2175 a
5 a.m.
6 abandoned
32 able
370 about
File 4:
5 911
1699 a
19 able
311 about
21 above
6 abuse
The desired result is to have the items in Column 2 that are common to ALL files to be removed from each respective file. The desired result is the following:
File 1:
9 1975
5 a.m.
16 able
8 above
5 access
File 2:
6 abandoned
140 abby
37 able
17 above
6 accident
File 3:
5 10
8 99
23 1992
7 2002
29 237th
11 60s
8 77th
5 a.m.
6 abandoned
32 able
File 4:
5 911
19 able
21 above
6 abuse
Some of the standard methods to find duplicate values do not work for this task because I am trying to find those values that are duplicate to multiple files.
Thus, something like comm
or sort/uniq
are not valid for this task.
Is there a certain type of awk
or other type of recursive tool that I can use to achieve my desired result?
Something like this (untested) will work if you can't have duplicated $2s within a file:
awk '
FNR==1 {
if (seen[FILENAME]++) {
firstPass = 0
outfile = FILENAME "_new"
}
else {
firstPass = 1
numFiles++
ARGV[ARGC++] = FILENAME
}
}
firstPass { count[$2]++; next }
count[$2] != numFiles { print > outfile }
' file1 file2 file3 file4
If you can have duplicated $2s within a file it's a tweak to only increment count[$2] the first time $2 appears in each file, e.g.
firstPass { if (!seen[FILENAME,$2]++) count[$2]++; next }