I want to solve a common but very specific problem: due to OCR errors, a lot of subtitle files contain the character "I" (upper case i) instead of "l" (lower case L).
My plan of attack is:
I could certainly tokenize and reconstruct the entire file in a script, but before I go down that path I was wondering if it is possible to use awk and/or sed for these kinds of conditional operations at the word-level?
Any other suggested approaches would also be very welcome!
You don't really need more than bash for this:
while read line; do
words=( $line )
for ((i=0; i<${#words[@]}; i++)); do
word=${words[$i]}
if [[ $(hunspell -l <<< $word) ]]; then
# hunspell had some output
tmp=${word//I/l}
if [[ $tmp != $word ]] && [[ -z $(hunspell -l <<< $tmp) ]]; then
# no output for new word, therefore it's a dictionary word
words[$i]=$tmp
fi
fi
done
# print the new line
echo "${words[@]}"
done < filename > filename.new
It does seem to make more sense to pass the whole file to hunspell, and parse the output of that.