Search code examples
linuxtextanalysisfrequencyword-frequency

Determining Word Frequency of Specific Terms


I'm a non-computer science student doing a history thesis that involves determining the frequency of specific terms in a number of texts and then plotting these frequencies over time to determine changes and trends. While I have figured out how to determine word frequencies for a given text file, I am dealing with a (relatively, for me) large number of files (>100) and for consistencies sake would like to limit the words included in the frequency count to a specific set of terms (sort of like the opposite of a "stop list")

This should be kept very simple. At the end all I need to have is the frequencies for the specific words for each text file I process, preferably in spreadsheet format (tab delineated file) so that I can then create graphs and visualizations using that data.

I use Linux day-to-day, am comfortable using the command line, and would love an open-source solution (or something I could run with WINE). That is not a requirement however:

I see two ways to solve this problem:

  1. Find a way strip-out all the words in a text file EXCEPT for the pre-defined list and then do the frequency count from there, or:
  2. Find a way to do a frequency count using just the terms from the pre-defined list.

Any ideas?


Solution

  • I would go with the second idea. Here is a simple Perl program that will read a list of words from the first file provided and print a count of each word in the list from the second file provided in tab-separated format. The list of words in the first file should be provided one per line.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    my $word_list_file = shift;
    my $process_file = shift;
    
    my %word_counts;
    
    # Open the word list file, read a line at a time, remove the newline,
    # add it to the hash of words to track, initialize the count to zero
    open(WORDS, $word_list_file) or die "Failed to open list file: $!\n";
    while (<WORDS>) {
      chomp;
      # Store words in lowercase for case-insensitive match
      $word_counts{lc($_)} = 0;
    }
    close(WORDS);
    
    # Read the text file one line at a time, break the text up into words
    # based on word boundaries (\b), iterate through each word incrementing
    # the word count in the word hash if the word is in the hash
    open(FILE, $process_file) or die "Failed to open process file: $!\n";
    
    while (<FILE>) {
      chomp;
      while ( /-$/ ) {
        # If the line ends in a hyphen, remove the hyphen and
        # continue reading lines until we find one that doesn't
        chop;
        my $next_line = <FILE>;
        defined($next_line) ? $_ .= $next_line : last;
      }
    
      my @words = split /\b/, lc; # Split the lower-cased version of the string
      foreach my $word (@words) {
        $word_counts{$word}++ if exists $word_counts{$word};
      }
    }
    close(FILE);
    
    # Print each word in the hash in alphabetical order along with the
    # number of time encountered, delimited by tabs (\t)
    foreach my $word (sort keys %word_counts)
    {
      print "$word\t$word_counts{$word}\n"
    }
    

    If the file words.txt contains:

    linux
    frequencies
    science
    words
    

    And the file text.txt contains the text of your post, the following command:

    perl analyze.pl words.txt text.txt
    

    will print:

    frequencies     3
    linux   1
    science 1
    words   3
    

    Note that breaking on word boundaries using \b may not work the way you want in all cases, for example, if your text files contain words that are hyphenated across lines you will need to do something a little more intelligent to match these. In this case you could check to see if the last character in a line is a hyphen and, if it is, just remove the hyphen and read another line before splitting the line into words.

    Edit: Updated version that handles words case-insensitively and handles hyphenated words across lines.

    Note that if there are hyphenated words, some of which are broken across lines and some that are not, this won't find them all because it only removed hyphens at the end of a line. In this case you may want to just remove all hyphens and match words after the hyphens are removed. You can do this by simply adding the following line right before the split function:

    s/-//g;