I have written a script for stemming English words, it does a decent job but it takes forever when I use it on big files, which have more than 1000 words, one per line. Are there ways to speed it up? Maybe a different approach altogether? Different programming language? Different stemmer?
file=$1
while read -r a
do
b="$(echo "$a" | hunspell -s -d en_US | wc -l)"
if [[ "$b" -eq 2 ]]
then
g="$(echo "$a" | hunspell -s -d en_US | wc -w)"
if [[ "$g" -eq 1 ]]
then
echo "$a" | hunspell -s -d en_US | awk 'FNR==1 {print $1}'
else
echo "$a" | hunspell -s -d en_US | awk 'FNR==1 {print $2}'
fi
else
if [[ "$a" == *ing ]] || [[ "$a" == *ed ]]
then
echo "$a" | hunspell -s -d en_US | awk 'FNR==2 {print $2}'
else
echo "$a" | hunspell -s -d en_US | awk 'FNR==1 {print $1}'
fi
fi
done < "$file"
Here's an example of what it does.
input file
cliché
womb
range
strain
fiddle
coup
earnest
touched
gave
dazzling
blindfolded
stagger
buying
insignia
output
cliché
womb
range
strain
fiddle
coup
earnest
touch
give
dazzle
blindfold
stagger
buy
insignia
If you run hunspell -s -d en_US word
, it can give you different results depending on a word. Options, and actions to take, follow:
The following emits the exact same output (but for changing gave
to give
, which my hunspell
appears not to have in its dictionary) -- and far, far faster:
last_word=; stems=( )
while read -r word stem _; do
if [[ $word ]]; then
last_word=$word
[[ $stem ]] && stems+=( "$stem" )
else
if (( ${#stems[@]} == 0 )); then
printf '%s\n' "$last_word" # no stems available; print input word
elif (( ${#stems[@]} == 1 )); then
printf '%s\n' "${stems[0]}" # found one stem; print it.
else
case $last_word in
*ing|*ed) printf '%s\n' "${stems[1]}" ;; # "ing" or "ed": print the 2nd stem
*) printf '%s\n' "${stems[0]}" ;; # otherwise: print the 1st stem
esac
fi
stems=( )
fi
done < <(hunspell -s -d en_US <"$1")
Note that this runs hunspell
only once for the whole file, not once per word; it's restarting hunspell
over and over, not anything to do with bash, where your script is spending all its time.