Search code examples
perlsortingawksedsdiff

how to compare values between two files?


I have two files with two columns separated by space

cat file1.txt
281475225437349 33,32,21,17,20,22,18,30,19,16,23,31
281475550885480 35,32,33,21,39,40,57,36,41,17,20,38,34,37,16,99

cat file2.txt
281475550885480 16,17,20,21,32,33,34,35,36,37,38,39,40,41
281475225437349 16,17,18,19,20,21,22,23,24,25,30,31,32,33

i want to compare values from file1 column2 with file2 column2 for same value in column1. and print only those values which does exist in file1 column2 but not in file2 column2 along and not vice versa along with respective value in column1

Desired o/p

it should not print anything for 281475225437349 as all values from file1 column2 present in file2 column 2 for 281475225437349

it should only print values for 281475550885480 which present in file1 column2 but not present in file2 column2 . Like values 57 and 99 for 281475550885480

So o/p file like:

cat output.txt
281475550885480 57,99

I have tried sorting the file and the comparing using sdiff but it give difference and its taking time

sdiff file1.txt file2.txt

Solution

  • Perl solution: create a hash of hashes from the second file. The key is the large number, the inner keys are the smaller numbers from the comma separated list. Then iterate over the first file and check what numbers aren't mentioned in the remembered structure.

    #!/usr/bin/perl
    use warnings;
    use strict;
    use feature qw{ say };
    
    open my $f1, '<', 'file1' or die $!;
    open my $f2, '<', 'file2' or die $!;
    
    my %seen;
    while (<$f2>) {
        my ($key, $value_string) = split ' ';
        my @values = split /,/, $value_string;  #/
        undef @{ $seen{$key} }{@values};
    }
    while (<$f1>) {
        my ($key, $value_string) = split ' ';
        my @values = split /,/, $value_string;
        my %surplus;
        undef @surplus{@values};
        delete @surplus{ keys %{ $seen{$key} } };
        say $key, ' ', join ',', keys %surplus
            if keys %surplus;
    }
    

    BTW, when you switch the files, the output will be

    281475225437349 24,25
    

    because 24 and 25 aren't present in file1.