regex powershell perl large-data large-files

Fastest way to replace characters in a very large data file

I have a very large file (19M rows) and the data needs to be cleaned. I am using a windows 11 machine. The data is being loaded into SQL server 19. I am currently using a perl script to remove any commas that are between double quotes. I will post my script below. This is taking a very long time to run. I feel like powershell would be quicker, but I can not seem to get it to run the REGEX that I need.

#! /usr/bin/perl -w
use strict;
use warnings;
 

my $SRC = $ARGV[0];
my $DES = $ARGV[1];
open(SRC,'<',$SRC) or die $!;
open(DES,'>',$DES) or die $!;

my $n_fields = "";
while (<>) {
    s/\s+$//;
    if (/^\#/) { # header line
        my @t = split(/,/);
        $n_fields = scalar(@t); # total number of fields
    } else { # actual data
        my $n_commas = $_ =~s/,/,/g; # total number of commas
        foreach my $i (0 .. $n_commas - $n_fields) { # iterate ($n_commas - $n_fields + 1) times
            s/(\"[^",]+),([^"]+\")/$1\"$2/g; # single replacement per previous answers
        }
        s/\"//g; # removal of double quotes (if you want)
    }
    print DES "$_\n";
}

close (DES);
close (SRC);

The code I provided above works, but is very very slow. Looking for a quicker way to process the data. Than you in advance.

Solution

I take the task to be: remove commas under quotes, and remove quotes as well. (So clean up extra commas and remove quotes from a CSV file.)

For starters here is some streamlined regex-based code that does that. I'd expect this to be considerably faster than the OP code. Please provide a sample input (and/or time this code) for improvements. Please note the assumptions clearly taken.

use warnings;
use strict;
use feature 'say';

my $file = shift // die "Usage: $0 file\n";

open my $fh, $file or die "Can't open $file: $!";

my $header = <$fh>;

if ($header !~ /^#/) {  # not header but data! Process
    print $header =~ s{"[^"]+"}{ $& =~ tr/,"//dr }ger;  #" stop bad syntax hilite
}

while (<$fh>) {
    print s{"[^"]+"}{ $& =~ tr/,"//dr }ger;  #"
}

Using a library (like Text::CSV, see here for an example) is always recommended but may not be faster in this case. (Need a realistic sample to benchmark this against a CSV library.)^†

^† While waiting for a realistic sample here is a benchmark.

I make an input file by repeating 1,000 times, and shuffling, lines:

have,"text, and, commas","and more, commas",end
so,"no stuff with commas","yes remove, commas",done

Then run on it a program

use warnings;
use strict;
use feature 'say';

#use Test::More qw(no_plan);  # uncomment to run a test below
use Text::CSV; 
use Benchmark qw(cmpthese);

my $file = shift // die "Usage: $0 file\n";

my $runfor = shift // 3;

sub by_lib {
    my ($csv, $fh) = @_;
    my @lines;

    my @headers = @{ $csv->getline($fh) };
    if ($headers[0] !~ /^#/) { # not headers! data
        tr/,//d for @$row; 
        push @lines, join ',', @$row;
    }

    while (my $row = $csv->getline($fh)) {  #
        tr/,//d for @$row;
        push @lines, join ',', @$row;
    }

    seek $fh, 0, 0;  # return filehandle to beginning for the next run
    return \@lines;
}

sub by_regex {
    my ($fh) = @_;    
    my @lines;

    my $header = <$fh>;
    if ($header !~ /^#/) { # not actually a header but data!
        push @lines, $header =~ s{"([^"]+)"}{ $1 =~ tr/,//dr }ger; #"
    }

    while (<$fh>) {
        push @lines, s{"([^"]+)"}{ $1 =~ tr/,//dr }ger;  #"
    }

    seek $fh, 0, 0;  # return filehandle to beginning for the next run

    chomp @lines;
    return \@lines;
};    



my $csv = Text::CSV->new( { binary => 1, allow_whitespace => 1 } ) 
    or die "Cannot use CSV: " . Text::CSV->error_diag (); 

open my $fh, $file or die $!;

# Check, at least once for every change in code
#is_deeply( by_regex($fh),  by_lib($csv, $fh), 'Cleaned up CSV' ); 

cmpthese( -$runfor, {
    by_regex => sub { by_regex($fh) },
    by_lib   => sub { by_lib($csv, $fh) },
});

If your input plays loose with quotes, having fields like some "then quoted" more instead of the whole field being quoted, then use allow_loose_quotes attribute with the library.

Running program.pl input.csv 30 (each test for 30 seconds) on an old laptop with v5.16 I get

          Rate   by_lib by_regex
by_lib   130/s       --     -14%
by_regex 152/s      17%       --

(Practically the same on a server with 5.36.0, with rates roughly twice as large.)

If this holds in your case, and the regex cannot be substantially improved, then I'd strongly recommend using the code with the CSV library instead. The library's correctness and ability to handle a range of unexpected/bad CSV properties is far more worth than 20-ish percent of speed improvement.

Then instead of collecting lines in an array (push @lines, join ',', @$row;), done above for benchmarking, print them as they come (say join ',', @$row;), at least for very large files.