Search code examples
bashstemminghunspell

Wrapping hunspell to stem a large number of words efficiently?


I have written a script for stemming English words, it does a decent job but it takes forever when I use it on big files, which have more than 1000 words, one per line. Are there ways to speed it up? Maybe a different approach altogether? Different programming language? Different stemmer?

file=$1
while read -r a
do
b="$(echo "$a" | hunspell -s -d en_US | wc -l)"
if [[ "$b" -eq 2 ]]
 then
   g="$(echo "$a" | hunspell -s -d en_US | wc -w)"
   if [[ "$g" -eq 1 ]]
    then
     echo "$a" | hunspell -s -d en_US | awk 'FNR==1 {print $1}'
    else
     echo "$a" | hunspell -s -d en_US | awk 'FNR==1 {print $2}'
   fi
 else
   if [[ "$a" == *ing ]] || [[ "$a" == *ed ]]
     then
       echo "$a" | hunspell -s -d en_US | awk 'FNR==2 {print $2}'
     else
       echo "$a" | hunspell -s -d en_US | awk 'FNR==1 {print $1}'
   fi
fi
done < "$file" 

Here's an example of what it does.

input file

cliché
womb
range
strain
fiddle
coup
earnest
touched
gave
dazzling
blindfolded
stagger
buying
insignia

output

cliché
womb
range
strain
fiddle
coup
earnest
touch
give
dazzle
blindfold
stagger
buy
insignia

How it works

If you run hunspell -s -d en_US word, it can give you different results depending on a word. Options, and actions to take, follow:

  • One line with one word (print that word)
  • One line with two words (print second word)
  • Two lines with two words; ends with "ing" or "ed" (print second word on second line)
  • Two lines with two words; not ending with "ing" or "ed" (print first word on first line)

Solution

  • The following emits the exact same output (but for changing gave to give, which my hunspell appears not to have in its dictionary) -- and far, far faster:

    last_word=; stems=( )
    while read -r word stem _; do
      if [[ $word ]]; then
        last_word=$word
        [[ $stem ]] && stems+=( "$stem" )
      else
        if (( ${#stems[@]} == 0 )); then
          printf '%s\n' "$last_word"        # no stems available; print input word
        elif (( ${#stems[@]} == 1 )); then
          printf '%s\n' "${stems[0]}"       # found one stem; print it.
        else
          case $last_word in
            *ing|*ed) printf '%s\n' "${stems[1]}" ;; # "ing" or "ed": print the 2nd stem
            *)        printf '%s\n' "${stems[0]}" ;; # otherwise: print the 1st stem
          esac
        fi
        stems=( )
      fi
    done < <(hunspell -s -d en_US <"$1")
    

    Note that this runs hunspell only once for the whole file, not once per word; it's restarting hunspell over and over, not anything to do with bash, where your script is spending all its time.