Created a script to count the frequency of words in a plain text file. The script performs the following steps:
The script is at: http://pastebin.com/VAZdeKXs
#!/bin/bash
# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
sed -e 's/ /\n/g' -e 's/[^a-zA-Z\n]//g' corpus.txt | \
tr [:upper:] [:lower:] | \
sort | \
uniq -c | \
sort -rn > frequency.txt
echo Creating corpus lexicon...
rm -f corpus-lexicon.txt
for i in $(awk '{if( $2 ) print $2}' frequency.txt); do
grep -m 1 ^$i\$ dictionary.txt >> corpus-lexicon.txt;
done
echo Creating lexicon...
rm -f lexicon.txt
for i in $(cat corpus-lexicon.txt); do
egrep -m 1 "^[0-9 ]* $i\$" frequency.txt | \
awk '{print $2, $1}' | \
tr ' ' ',' >> lexicon.txt;
done
The following lines continually cycle through the dictionary to match words:
for i in $(awk '{if( $2 ) print $2}' frequency.txt); do
grep -m 1 ^$i\$ dictionary.txt >> corpus-lexicon.txt;
done
It works, but it is slow because it is scanning the words it found to remove any that are not in the dictionary. The code performs this task by scanning the dictionary for every single word. (The -m 1
parameter stops the scan when the match is found.)
How would you optimize the script so that the dictionary is not scanned from start to finish for every single word? The majority of the words will not be in the dictionary.
Thank you!
You can use grep -f
to search for all of the words in one pass over frequency.txt:
awk '{print $2}' frequency.txt | grep -Fxf dictionary.txt > corpus-lexicon.txt
-F
to search for fixed strings.-x
to match whole lines only.-f
to read the search patterns from dictionary.txtIn fact, you could even combine this with the second loop and eliminate the intermediate corpus-lexicon.txt file. The two for loops can be replaced by a single grep:
grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'
Notice that I changed -x
to -w
.