Search code examples
linuxbashcompareuniquediff

In linux I want to compare yesterday's file to today's file getting only changes from today as the output, ignoring some fields


I have 2 pipe delimited files. yesterday.txt and today.txt

yesterday.txt:

1234|12|Bill|Blatt|programmer
3243|34|Bill|Blatt|dentist
98734|25|Jack|Blatt|programmer
748567|31|Mark|Spark|magician

today.txt

123|12|Bill|Blatt|programmer
3243|4|Bill|Blatt|dentist
934|25|Jack|Blatt|prograbber
30495|89|Dave|Scratt|slobber

I would like to compare the 2 files while ignoring the first 2 fields and output any lines unique to the second file (today.txt), but I want the full lines even though the comparison is omitting the first 2 fields. So in the case above the output would be:

new_today.txt

934|25|Jack|Blatt|prograbber
30495|89|Dave|Scratt|slobber

I tried to accomplish using this:

sort <(cut -d"|" -f3- yesterday.txt) <(cut -d"|" -f3- yesterday.txt) <(cut -d"|" -f3- today.txt) | uniq -u

This almost works, but it doesn't give me the 2 fields that I cut. I'm not sure how to accomplish this. Any help would be much appreciated.


Solution

  • When the size of the first file is not too big, an efficient solution is possible using Awk, and without sorting:

    awk -F'|' -v OFS='|' '
      NR == FNR {
        $1 = "";
        $2 = "";
        seen[$0]++;
     }
     NR != FNR {
       orig=$0;
       $1 = "";
       $2 = "";
       if (!seen[$0]) print orig
     }' today.txt new_today.txt`
    

    As a one-liner: awk -F'|' 'NR == FNR { $1 = ""; $2 = ""; seen[$0]++ } NR != FNR { orig=$0; $1 = ""; $2 = ""; if (!seen[$0]) print orig }' today.txt new_today.txt

    For the example input files this outputs:

    934|25|Jack|Blatt|prograbber
    30495|89|Dave|Scratt|slobber
    

    Here's how it works:

    • We pass two input files to the Awk script on the command line. This will be important.
    • -F'|' -- use pipe as the field separator.
    • Filter 1: NR == FNR -- this matches lines in the first input file.
      • We build a map of lines we've seen, without the first two fields. We do this by clearing the values of the first two fields ($1, $2), and using the rest ($0) as the key, and count it.
    • Filter 2: NR != FNR -- this matches lines not in the first input file.
      • We save the original line, compute the key, and if we haven't seen it yet, then we print the original line.

    Notice that this approach also preserves the original order of the lines in the second file.