I am processing a large directory every night. It accumulates around 1 million files each night, half of which are .txt
files that I need to move to a different directory according to their contents.
Each .txt
file is pipe-delimited and contains only 20 records. Record 6 is the one that contains the information I need to determine which directory to move the file to.
Example Record:
A|CHNL_ID|4
In this case the file would be moved to /out/4
.
This script is processing at a rate of 80,000 files per hour.
Are there any recommendations on how I could speed this up?
opendir(DIR, $dir) or die "$!\n";
while ( defined( my $txtFile = readdir DIR ) ) {
next if( $txtFile !~ /.txt$/ );
$cnt++;
local $/;
open my $fh, '<', $txtFile or die $!, $/;
my $data = <$fh>;
my ($channel) = $data =~ /A\|CHNL_ID\|(\d+)/i;
close($fh);
move ($txtFile, "$outDir/$channel") or die $!, $/;
}
closedir(DIR);
Try something like:
print localtime()."\n"; #to find where time is spent
opendir(DIR, $dir) or die "$!\n";
my @txtFiles = map "$dir/$_", grep /\.txt$/, readdir DIR;
closedir(DIR);
print localtime()."\n";
my %fileGroup;
for my $txtFile (@txtFiles){
# local $/ = "\n"; #\n or other record separator
open my $fh, '<', $txtFile or die $!;
local $_ = join("", map {<$fh>} 1..6); #read 6 records, not whole file
close($fh);
push @{ $fileGroup{$1} }, $txtFile
if /A\|CHNL_ID\|(\d+)/i or die "No channel found in $_";
}
for my $channel (sort keys %fileGroup){
moveGroup( @{ $fileGroup{$channel} }, "$outDir/$channel" );
}
print localtime()." finito\n";
sub moveGroup {
my $dir=pop@_;
print localtime()." <- start $dir\n";
move($_, $dir) for @_; #or something else if each move spawns sub process
#rename($_,$dir) for @_;
}
This splits the job into three main parts where you can time each part to find where most time is spent.