Search code examples
linuxfileawksedcut

How do I check if there are duplicate values across files at a specific position


I have about 2000 files in a directory on a Linux server. In each file, the positions x-y have invoice numbers. Which is the best way to check if there are duplicates across these files and print the file names and values? A simplified version of the problem -

$ cat a.txt 
xyz1234
xyz1234
pqr4567
$ cat b.txt 
lon9876
lon9876
lon4567

In the above 2 files, assuming that the Invoice numbers are in the position 4-8, we have duplicates - "4567" in a.txt and b.txt. If we have duplicates in the same file - as we have 1234 in a.txt, it is fine. No need to print that.I tried to cut the inv numbers, but the output doesn't have file names. My plan was to cut, get the file names also along with the Invoice numbers, do a unique on the output etc.


Solution

  • Perl to the rescue!

    perl -lne '
        $in_file{ substr $_, 3, 4 }{$ARGV} = 1;
        END {
            for $invoice (%in_file) {
                print join "\t", $invoice, keys %{ $in_file{$invoice} }
                    if keys %{ $in_file{$invoice} } > 1;
            }
        }
    ' -- *txt
    
    • -n reads the input files line by line, running the code for each;
    • -l removes newlines from the input and adds them to printed lines;
    • $ARGV contains the name of the currently open file;
    • we build a hash of hashes, the first level key is the invoice number, the second level key is the file it was found in;
    • see substr for the details on how to extract the invoice number;
    • at the end of all input, we print the keys (i.e. invoice numbers) that have more than one file associated with themselves.