I've patched together some Perl script intended to take each word from a batch of documents, eliminate all stop words, stem the remaining words, and create a hash containing each stemmed word and its frequency of occurrence. However, after working on it for several minutes, I get an "Out of Memory!" message in the command window. Is there a more efficient way to achieve the desired result, or do I just need to find a way to access more memory?
#!/usr/bin/perl
use strict;
use warnings;
use Lingua::EN::StopWords qw(%StopWords);
use Lingua::Stem qw(stem);
use Mojo::DOM;
my $path = "U:/Perl/risk disclosures/2006-28";
chdir($path) or die "Cant chdir to $path $!";
# This program counts the total number of unique sentences in a 10-K and enumerates the frequency of each one.
my @sequence;
my %sequences;
my $fh;
# Opening each file and reading its contents.
for my $file (<*.htm>) {
my $data = do {
open my $fh, '<', $file;
local $/; # Slurp mode
<$fh>;
};
my $dom = Mojo::DOM->new($data);
my $text = $dom->all_text();
for ( split /\s+/, $text ) {
# Here eliminating stop words.
while ( !$StopWords{$_} ) {
# Here retaining only the word stem.
my $stemmed_word = stem($_);
++$sequences{"$stemmed_word"};
}
}
}
If a word is not in %StopWords
, you enter an infinite loop:
while ( !$StopWords{$_} ) {
my $stemmed_word = stem($_);
++$sequences{"$stemmed_word"};
# %StopWords hasn't changed, so $_ is still not in it
}
There's actually no reason to use a loop here at all. You're already checking one word at a time with your for
loop. A word is either a stop-word or it isn't, so you only need to check it once.
I would do something more like the following:
my $dom = Mojo::DOM->new($data);
my @words = split ' ', $dom->all_text();
foreach my $word (@words) {
next if defined $StopWords{$word};
my $stemmed_word = stem $word;
++$sequences{$stemmed_word};
}
In addition to replacing the inner while
loop with
next if defined $StopWords{$word};
I also
$text
variable, since it seems like you really only care about individual words, not the full block of textfor
. Various functions change $_
automatically so to avoid unintended side-effects, I use explicit loop variables for everything but one-liners like say for @array;
++$sequences{"$stemmed_word"};