Search code examples
regexperlaixreaddirfile-processing

Perl Program to efficiently process 500,000 small files in a directory


I am processing a large directory every night. It accumulates around 1 million files each night, half of which are .txt files that I need to move to a different directory according to their contents.

Each .txt file is pipe-delimited and contains only 20 records. Record 6 is the one that contains the information I need to determine which directory to move the file to.

Example Record:

A|CHNL_ID|4

In this case the file would be moved to /out/4.

This script is processing at a rate of 80,000 files per hour.

Are there any recommendations on how I could speed this up?

opendir(DIR, $dir) or die "$!\n";
while ( defined( my $txtFile = readdir DIR ) ) {
    next if( $txtFile !~ /.txt$/ );
    $cnt++;

    local $/;
    open my $fh, '<', $txtFile or die $!, $/;
    my $data  = <$fh>;
    my ($channel) =  $data =~ /A\|CHNL_ID\|(\d+)/i;
    close($fh);

    move ($txtFile, "$outDir/$channel") or die $!, $/;
}
closedir(DIR);

Solution

  • Try something like:

    print localtime()."\n";                          #to find where time is spent
    opendir(DIR, $dir) or die "$!\n";
    my @txtFiles = map "$dir/$_", grep /\.txt$/, readdir DIR;
    closedir(DIR);
    
    print localtime()."\n";
    my %fileGroup;
    for my $txtFile (@txtFiles){
        # local $/ = "\n";                           #\n or other record separator
        open my $fh, '<', $txtFile or die $!;
        local $_ = join("", map {<$fh>} 1..6);      #read 6 records, not whole file
        close($fh);
        push @{ $fileGroup{$1} }, $txtFile
          if /A\|CHNL_ID\|(\d+)/i or die "No channel found in $_";
    }
    
    for my $channel (sort keys %fileGroup){
      moveGroup( @{ $fileGroup{$channel} }, "$outDir/$channel" );
    }
    print localtime()." finito\n";
    
    sub moveGroup {
      my $dir=pop@_;
      print localtime()." <- start $dir\n";
      move($_, $dir) for @_;  #or something else if each move spawns sub process
      #rename($_,$dir) for @_;
    }
    

    This splits the job into three main parts where you can time each part to find where most time is spent.