performance perl directory delete-file large-data

Writing Efficient Perl Code to Crawl Through Large Directories

I am writing a Perl script that crawls through a directory of 300,000+ files, and deletes all files except the first in a given year. My problem that my code forces Perl to essentially scan the directory of 300,000 files around an estimated 300,001 times. So far its been running for four days, and I was hoping you guys had some tips to make code like this more efficient in the future.

Script:

#!/usr/bin/perl
use Date::Calc qw(Delta_Days Decode_Date_EU);
# Note: must use default perl moudle on Killdevil (module add perl)

@base = (1993, 1, 1);
$count = 0;

@files = <*>; # Creates array of all files in directory
foreach $file (@files) {
    # Splits indivdual filename into an array seperated by
    # comma (CIK, 10, K, Year, Month, Date) indexed by 0-5
    @filearray = split(/\-/, $file);

    $cik = $filearray[0];
    $cikyear = $filearray[3];

    # Defines a new array as all files in directory with the
    # same CIK and year as our file
    @cikfiles = grep { /^$cik-10-K-$cikyear/ } <*>;

    $sizecik = @cikfiles;
    $best = 0; # Index for file with earliest date
    $bestsize = 1000000000000000000000000000; # Initial value to beat

    # Only run through the proccess if there are
    # multiple files with same CIK same year.
    if ($sizecik != 1) {

        for($i = 1; $i < $sizecik + 1; $i = $i + 1) {
            # Read filename and creates an array deliminated by "-"
            @filearray1 = split(/-/, $cikfiles[$i-1]);

            $year = $filearray1[3];
            $month = $filearray1[4];

            # Deletes leading zero from months if there exists one
            $month =~ s/^0//;
            $day = $filearray1[5];
            $day =~ s/^0//; # Removes leading zero

            # Calculates number of days from base year
            $dd = Delta_Days($base[0], $base[1], $base[2], $year, $month, $day);

            if ($dd < $bestsize) {
                # If has lower number of days than current best, index
                # this file as the new leader
                $best = $i;

                # Reset the size to beat to the dd of this file
                $bestsize = $dd;
            }
        }

        for ($i = 1; $i < $sizecik + 1; $i = $i + 1) {
            # Runs through current array and deletes all
            # files that are not the best
            if($i != $best) {
                $rm = "rm " . $cikfiles[$i-1];
                system($rm);
                $count = $count + 1;
            }
        }
    }
}

# Displays total number of files removed
print "Number of files deleted: $count";

close(MYOUTFILE);

Would it be more efficient if instead of looking through the directory

@cikfiles = grep { /^$cik-10-K-$cikyear/ } <*>;

I instead searched through the original array and then deleted the entries?

@cikfiles = grep { /^$cik-10-K-$cikyear/ } <@files>;

How would I remove the elements I delete from the @files array?

Solution

There's no need to scan the directory more than once. Scan the directory once, collecting the information you need.

If dates are formatted as YYYYMMDD, a simple string compare can be used to determine which of two dates is older.

my $opt_dry_run = 1;

my %files_by_cik_and_year;
while (<*>) {
   my ($cik, undef, undef, $year, $month, $day) = split(/-/, $_);
   push @{ $files_by_cik_and_year{$cik}{$year} },
      [ $_, sprintf("%04d%02d%02d", $year, $month, $day) ];
}

for my $cik (keys(%files_by_cik_and_year)) {
   for my $year (keys(%{ $files_by_cik_and_year{$cik} })) {
      my @files =
         map { $_->[0] }
            sort { $a->[1] cmp $b->[1] }
               @{ $files_by_cik_and_year{$cik}{$year} };

      shift(@files);

      for (@files) {
         print("Deleting $_\n");
         if (!$opt_dry_run) {
            unlink($_)
               or warn("Couldn't delete $_\n");
         }
      }
   }
}

Simplified:

my $opt_dry_run = 1;

my %files_by_cik_and_year;
while (<*>) {
   my ($cik, undef, undef, $year, $month, $day) = split(/-/, $_);
   push @{ $files_by_cik_and_year{"$cik-$year"} },
      [ $_, sprintf("%04d%02d%02d", $year, $month, $day) ];
}

for (values(%files_by_cik_and_year)) {
   my @files =
      map { $_->[0] }
         sort { $a->[1] cmp $b->[1] }
            @$_;

   shift(@files);

   for (@files) {
      print("Deleting $_\n");
      if (!$opt_dry_run) {
         unlink($_)
            or warn("Couldn't delete $_\n");
      }
   }
}