Search code examples
bashawkgrepmultiple-columns

Using awk and inverse grep for specific columns across files


Assume a tab-separated file as input file 1 ($IN1):

$ cat input1.tsv
HEAD1   HEAD2   HEAD3   HEAD4   HEAD5   HEAD6
Qux ZX_999876   Bar Foo MN111111    Quux
Foo AB_123456   Bar Baz CD789123    Qux
Bar AC_456321   Baz Qux GF333444    Foo
Foo CD789123    Qux Baz GH987124    Qux

Further assume a csv-separated file as input file 2 ($IN2):

$ cat input2.csv
AB_123456,CD789123
ZX_999876,MN111111

Input file 2 specifies which lines of input file 1 need to be deleted. Specifically, if column 1 entry of input file 2 is equal to the column 2 entry of input file 1, then all lines of input file 1 in which column 2 entry of input file 2 is equal to the column 2 entry of input file 1 are to be removed.

Desired output:

$ cat input1_filtered.tsv
HEAD1   HEAD2   HEAD3   HEAD4   HEAD5   HEAD6
Qux ZX_999876   Bar Foo MN111111    Quux
Foo AB_123456   Bar Baz CD789123    Qux
Bar AC_456321   Baz Qux GF333444    Foo

I wish to achieve the desired output via Bash. My (so far insufficient) attempt:

while IFS=, read -r CL1 CL2
do
    if awk -F"\t" '$2=="$CL1"' $IN1
    then
        grep -vPw "(?<=\t)$CL2" $IN1 > tmp.txt
        mv tmp.txt $IN1
    fi
done < $IN2

Solution

  • This uses GNU awk specifically:

    • process input2.csv, storing the mapping
    • process input1.tsv, determining if col1 of input2.csv is present
    • process input1.tsv a second time, filtering out the unwanted records
    gawk '
        BEGINFILE   { FS = ARGIND == 1 ? "," : "\t" }
        ARGIND == 1 { is_present[$1] = $2; next }
        ARGIND == 2 { if ($2 in is_present) del[is_present[$2]] = 1; next }
        !($2 in del)
    ' input2.csv input1.tsv input1.tsv
    

    outputs

    HEAD1   HEAD2   HEAD3   HEAD4   HEAD5   HEAD6
    Qux ZX_999876   Bar Foo MN111111    Quux
    Foo AB_123456   Bar Baz CD789123    Qux
    Bar AC_456321   Baz Qux GF333444    Foo
    

    ref: ARGIND, BEGINFILE