Search code examples
csvperlcpanmetacpan

Compare 2 CSV Huge CSV Files and print the differences to another csv file using perl


I have 2 csv files of multiple fields(approx 30 fields), and huge size ( approx 4GB ).

File1:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"

File2:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"

I want to compare these 2 files and report the differences into another csv file. In the above example, Employee Vivek and Karthick details are present in different row numbers but still the record data is same, so it should be considered as match. Employee Vinoth record should be considered as a mismatch since there is a mismatch in the address.

Output diff.csv file can contain the mismatched record from the File1 and File 2 as below.

Diff.csv
EmployeeName,Age,Salary,Address
F1, Vinoth,12,2548.245,"140,North Street,India" 
F2, Vinoth,12,2548.245,"140,North Street,USA"

I've written the code so far as below. After this I'm confused which option to choose whether a Binary Search or any other efficient way to do this. Could you please help me?

My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.

use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );

# Get Command Line Parameters

my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
  or die("Error in command line arguments\n");

if ( !defined $opts{'file1'} ) {
    die "CSV file --file1 not specified.\n";
}
if ( !defined $opts{'file2'} ) {
    die "CSV file --file2 not specified.\n";
}

my $file1 = $opts{'file1'};
my $file2 = $opts{'file2'};
my $file3 = 'diff.csv';

print $file2 . "\n";

my $csv1 =
  Text::CSV_XS->new(
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csv2 =
  Text::CSV_XS->new(
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csvout =
  Text::CSV_XS->new(
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );

open( my $fh1, '<:encoding(utf8)', $file1 )
  or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
  or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
  or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );

my $f1line   = undef;
my $f2line   = undef;
my $header1  = undef;
my $f1empty  = 'false';
my $f2empty  = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' ) {
    $f1line = $csv1->getline($fh1);
}
while (1) {

    if ( $f1empty eq 'false' ) {
        $f1line = $csv1->getline($fh1);
    }
    if ( !defined $f1line ) {
        $f1empty = 'true';
    }

    if ( $f1empty eq 'true' ) {
        last;
    }
    else {
        ## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in perl. Need help here.

        }
    }

print "End of Program" . "\n";

Solution

  • Storing data of such magnitude in database is most correct approach to tasks of this kind. At minimum SQLlite is recommended but other databases MariaDB, MySQL, PostgreSQL will work quite well.

    Following code demonstrates how desired output can be achieved without special modules, but it does not take in account possibly messed up input data. This script will report data records as different even if difference can be just one extra space.

    Default output is into console window unless you specify option output.

    NOTE: Whole file #1 is read into memory, please be patient processing big files can take a while.

    use strict;
    use warnings;
    use feature 'say';
    
    use Getopt::Long qw(GetOptions);
    use Pod::Usage;
    
    my %opt;
    my @args = (
                'file1|f1=s',
                'file2|f2=s',
                'output|o=s',
                'debug|d',
                'help|?',
                'man|m'
            );
    
    GetOptions( \%opt, @args ) or pod2usage(2);
    
    print Dumper(\%opt) if $opt{debug};
    
    pod2usage(1) if $opt{help};
    pod2usage(-exitval => 0, -verbose => 2) if $opt{man};
    
    pod2usage(1) unless $opt{file1};
    pod2usage(1) unless $opt{file2};
    
    unlink $opt{output} if defined $opt{output} and -f $opt{output};
    
    compare($opt{file1},$opt{file2});
    
    sub compare {
        my $fname1 = shift;
        my $fname2 = shift;
    
        my $hfile1 = file2hash($fname1);
    
        open my $fh, '<:encoding(utf8)', $fname2
            or die "Couldn't open $fname2";
    
        while(<$fh>) {
            chomp;
            next unless /^(.*?),(.*)$/;
            my($key,$data) = ($1, $2);
            if( !defined $hfile1->{$key} ) {
                my $msg = "$fname1 $key is missing";
                say_msg($msg);
            } elsif( $data ne $hfile1->{$key} ) {
                my $msg = "$fname1 $key,$hfile1->{$key}\n$fname2 $_";
                say_msg($msg);
            }
        }
    }
    
    sub say_msg {
        my $msg = shift;
    
        if( $opt{output} ) {
            open my $fh, '>>:encoding(utf8)', $opt{output}
                or die "Couldn't to open $opt{output}";
    
            say $fh $msg;
    
            close $fh;
        } else {
            say $msg;
        }
    }
    
    sub file2hash {
        my $fname = shift;
        my %hash;
    
        open my $fh, '<:encoding(utf8)', $fname
            or die "Couldn't open $fname";
    
        while(<$fh>) {
            chomp;
            next unless /^(.*?),(.*)$/;
            $hash{$1} = $2;
    
        }
    
        close $fh;
    
        return \%hash;
    }
    
    __END__
    
    =head1 NAME
    
    comp_cvs - compares two CVS files and stores differense 
    
    =head1 SYNOPSIS
    
     comp_cvs.pl -f1 file1.cvs -f2 file2.cvs -o diff.txt
    
     Options:
        -f1,--file1 input CVS filename #1
        -f2,--file2 input CVS filename #2
        -o,--output output filename
        -d,--debug  output debug information
        -?,--help   brief help message
        -m,--man    full documentation
    
    =head1 OPTIONS
    
    =over 4
    
    =item B<-f1,--file1>
    
    Input CVS filename #1
    
    =item B<-f2,--file2>
    
    Input CVS filename #2
    
    =item B<-o,--output>
    
    Output filename
    
    =item B<-d,--debug>
    
    Print debug information.
    
    =item B<-?,--help>
    
    Print a brief help message and exits.
    
    =item B<--man>
    
    Prints the manual page and exits.
    
    =back
    
    =head1 DESCRIPTION
    
    B<This program> accepts B<input> and processes to B<output> with purpose of achiving some goal.
    
    =head1 EXIT STATUS
    
    The section describes B<EXIT STATUS> codes of the program
    
    =head1 ENVIRONMENT
    
    The section describes B<ENVIRONMENT VARIABLES> utilized in the program
    
    =head1 FILES
    
    The section describes B<FILES> which used for program's configuration
    
    =head1 EXAMPLES
    
    The section demonstrates some B<EXAMPLES> of the code
    
    =head1 REPORTING BUGS
    
    The section provides information how to report bugs
    
    =head1 AUTHOR
    
    The section describing author and his contanct information
    
    =head1 ACKNOWLEDGMENT
    
    The section to give credits people in some way related to the code
    
    =head1 SEE ALSO
    
    The section describing related information - reference to other programs, blogs, website, ...
    
    =head1 HISTORY
    
    The section gives historical information related to the code of the program
    
    =head1 COPYRIGHT
    
    Copyright information related to the code
    
    =cut
    

    Output for test files

    file1.cvs Vinoth,12,2548.245,"140,North Street,India"
    file2.cvs Vinoth,12,2548.245,"140,North Street,USA"