Search code examples
awkcomparecomparison

Compare files with awk


I have two similar files (both with 3 columns). I'd like to check if these two files contains the same elements (but listed in a different orders). First of all I'd like to compare only the 1st columns

file1.txt

"aba" 0 0 
"abc" 0 1
"abd" 1 1 
"xxx" 0 0

file2.txt

"xyz" 0 0
"aba" 0 0
"xxx" 0 0
"abc" 1 1

How can I do it using awk? I tried to have a look around but I've found only complicate examples. What if I want to include also the other two columns on the comparison? The output should give me the number of matching elements.


Solution

  • To print the common elements in both files:

    $ awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
    "aba"
    "abc"
    "xxx"
    

    Explanation:

    NR and FNR are awk variables that store the total number of records and the number of records in the current files respectively (the default record is a line).

    NR==FNR # Only true when in the first file 
    {
        a[$1] # Build associative array on the first column of the file
        next  # Skip all proceeding blocks and process next line
    }
    ($1 in a) # Check in the value in column one of the second files is in the array
    {
        # If so print it
        print $1
    }
    

    If you want to match the whole lines then use $0:

    $ awk 'NR==FNR{a[$0];next}$0 in a{print $0}' file1 file2
    "aba" 0 0
    "xxx" 0 0
    

    Or a specific set of columns:

    $ awk 'NR==FNR{a[$1,$2,$3];next}($1,$2,$3) in a{print $1,$2,$3}' file1 file2
    "aba" 0 0
    "xxx" 0 0