Search code examples
performanceperlmemorystemmingstop-words

Most memory-efficient way to combine word stemming and the elimination of hash words in Perl?


I've patched together some Perl script intended to take each word from a batch of documents, eliminate all stop words, stem the remaining words, and create a hash containing each stemmed word and its frequency of occurrence. However, after working on it for several minutes, I get an "Out of Memory!" message in the command window. Is there a more efficient way to achieve the desired result, or do I just need to find a way to access more memory?

#!/usr/bin/perl
use strict;
use warnings;
use Lingua::EN::StopWords qw(%StopWords);
use Lingua::Stem qw(stem);
use Mojo::DOM;

my $path = "U:/Perl/risk disclosures/2006-28";
chdir($path) or die "Cant chdir to $path $!";

# This program counts the total number of unique sentences in a 10-K and enumerates the frequency     of each one.

my @sequence;
my %sequences;
my $fh;

# Opening each file and reading its contents.
for my $file (<*.htm>) {
    my $data = do {
        open my $fh, '<', $file;
        local $/;    # Slurp mode
        <$fh>;
    };
    my $dom  = Mojo::DOM->new($data);
    my $text = $dom->all_text();
    for ( split /\s+/, $text ) {
        # Here eliminating stop words.
        while ( !$StopWords{$_} ) {
            # Here retaining only the word stem.
            my $stemmed_word = stem($_);
            ++$sequences{"$stemmed_word"};
        }
    }
}

Solution

  • If a word is not in %StopWords, you enter an infinite loop:

    while ( !$StopWords{$_} ) {
        my $stemmed_word = stem($_);
        ++$sequences{"$stemmed_word"};
    
        # %StopWords hasn't changed, so $_ is still not in it
    }
    

    There's actually no reason to use a loop here at all. You're already checking one word at a time with your for loop. A word is either a stop-word or it isn't, so you only need to check it once.

    I would do something more like the following:

    my $dom  = Mojo::DOM->new($data);
    my @words = split ' ', $dom->all_text();
    
    foreach my $word (@words) {
        next if defined $StopWords{$word};
    
        my $stemmed_word = stem $word;
        ++$sequences{$stemmed_word};
    }
    

    In addition to replacing the inner while loop with

    next if defined $StopWords{$word};
    

    I also

    • removed the intermediate $text variable, since it seems like you really only care about individual words, not the full block of text
    • added an explicit loop variable in the for. Various functions change $_ automatically so to avoid unintended side-effects, I use explicit loop variables for everything but one-liners like say for @array;
    • removed extraneous quotation marks from ++$sequences{"$stemmed_word"};