I have a batch of annual corporate filings, each named using the following format: company identifier, two digit year, and a random set of digits (e.g., 00000217-12-00010.txt). I want to compare the contents of each annual filing to the filing submitted by the same company in the prior year (e.g., 000002178-13-00010.txt compared to 000002178-12-00005.txt). As I loop through each file, how can I identify the preceding year’s filing for each document so that I can read both documents in as separate strings?
use strict ;
use warnings ;
use autodie ;
use File::Find ;
### BEGIN BY READING IN EACH FILE ONE BY ONE. ###
################## LOOP BEGIN ##################
# Process every file with a `txt` file type
my $parent = "D:/Cleaned 10Ks" ;
my ($par_dir, $sub_dir);
opendir($par_dir, $parent);
while (my $sub_folders = readdir($par_dir)) {
next if ($sub_folders =~ /^..?$/); # skip . and ..
my $path = $parent . '/' . $sub_folders;
next unless (-d $path); # skip anything that isn't a directory
chdir($path) or die "Cant chdir to $path $!";
for my $filename ( grep -f, glob('*') ) {
#### FIND THE PRIOR YEAR'S CORRESPONDING FILING AND READ BOTH IN AS STRINGS###
Parse the filename for the components, say by splitting on -
, and then you can reduce the year by 1 and reassemble the name. The snag is the date -- if the year is 00
you can't just subtract 1. A proper way is to use a module for dates, but since 00
is the only tricky case you can do it manually.
my ($comp_id, $year) = split '-', $filename;
my $prev_year = ($year ne '00') ? $year - 1 : 99;
my $prev_year_base = join '-', $comp_id, $year;
my ($prev_year_file) = glob "$prev_year_base*";
Only the first two fields are asked for from split
, since the rest differs between files. The last year's filename is completed by globbing on these two components, taken to make it unique. If there may be other entries with names beginning the same way, the return from glob
should be processed. Since glob
returns a list (here with one element) we need ()
around that (sole) filename.