Search code examples
csvperlduplicatesuniquehashtable

Duplicate records in .CSV - How do In Duplicates, to ignore the similar values in Hash and warn only for different values in Perl


The following codes check for Duplicates in CSV file where TO Column is “USD”. I need your help to figure out how do I compare the resulted duplicate value, if the duplicate value has same value like in the below case, Perl should not give any warning, if the value is same. Perl file name is Source, just change the directory and run it.

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use List::MoreUtils qw/ uniq /;
my %seen = ();
my @uniq = ();
my %uniq;
my %data;
my %dupes;
my @rows;
my $csv = Text::CSV->new ()
                        or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<", 'D:\Longview\ENCDEVD740\DataServers\ENCDEVD740\lvaf\inbound\data\enc_meroll_fxrate_soa_load.csv' or die "Cannot use CSV: $!";
while ( my $row = $csv->getline( $fh ) ) {
    # insert row into row list  
    push @rows, $row;
    # join the unique keys with the
    # perl 'multidimensional array emulation' 
    # subscript  character
    my $key = join( $;, @{$row}[0,1] ); 
    # if it was just one field, just use
    # my $key = $row->[$keyfieldindex];
    # if you were checking for full line duplicates (header lines):
    # my $key = join($;, @$row);
    # if %data has an entry for the record, add it to dupes
    #print "@{$row}\n ";
    
    if (exists $data{$key}) { # duplicate 
             
        # if it isn't already duplicated
        # add this row and the original 
    if (not exists $dupes{$key}) {
      push @{$dupes{$key}}, $data{$key};
           
        }
        
        # add the duplicate row
         push @{$dupes{$key}}, $row;
           

    } else {
        $data{ $key } = $row;     

    }
}

$csv->eof or $csv->error_diag();
close $fh;

# print out duplicates:

warn "Duplicate Values:\n";
warn "-----------------\n";  


    foreach my $key (keys %dupes) {
    my @keys = split($;, $key);
    
if (($keys[1] ne 'USD') or ($keys[0] eq 'FROMCURRENCY')){

#print "Rejecting record since duplicate records are for Outofscope currencies\n";
#print "\$keys[0] = $keys[0]\n";
#print "\$keys[1] = $keys[1]\n";
next;

   }

else  { 
    
print "Key: @keys\n";

foreach my $dupe (@{$dupes{$key}}) {


print "\tData: @$dupe\n";
   
 }
    }

}

Source - CSV File

Query

CSV File

Sample data:

FROMCURRENCY,TOCURRENCY,RATE
AED,USD,0.272257011
ANG,USD,0.557584544
ARS,USD,0.01421147
AUD,USD,0.68635
AED,USD,0.272257011
ANG,USD,0.557584544
ARS,USD,0.01421147      

Different Values for duplicates


Solution

  • Like @Håkon wrote it seems like all your duplicates are in fact the same rate so they should not be considered duplicates. However, it could be an idea to store the rate in a hash mapped to each from and to currency. That way you don't need to check for duplicates every iteration and can rely on the uniqueness of the hash.

    It's great that you use proper CSV parsers but here's an example using a single hash to keep track of duplicates by just splitting by , since the data seems reliable.

    #!/usr/bin/env perl
    
    use warnings;
    use strict;
    
    my $result = {};
    my $format = "%-4s | %-4s | %s\n";
    
    while ( my $line = <DATA> ) {
        chomp $line;
    
        my ( $from, $to, $rate ) = split( /,/x, $line );
    
        $result->{$from}->{$to}->{$rate} = 1;
    }
    
    printf( $format, "FROM", "TO", "RATES" );
    printf( "%s\n", "-" x 40 );
    
    foreach my $from ( keys %$result ) {
        foreach my $to ( keys %{ $result->{$from} } ) {
            my @rates = keys %{ $result->{$from}->{$to} };
            next if @rates < 2;
    
            printf( $format, $from, $to, join( ", ", @rates ) );
        }
    }
    
    __DATA__
    AED,USD,0.272257011
    ANG,USD,0.557584545
    ANG,USD,1.557584545
    ARS,USD,0.01421147
    ARS,USD,0.01421147
    ARS,USD,0.01421147
    AUD,USD,0.68635
    AUD,USD,1.68635
    AUD,USD,2.68635
    

    I change the test data to contain duplicates with the same rate and with different rates and the result would print.

    FROM | TO   | RATES
    ----------------------------------------
    ANG  | USD  | 1.557584545, 0.557584545
    AUD  | USD  | 1.68635, 0.68635, 2.68635