Search code examples
bashsedawkhunspellspell-checking

Using awk for conditional find/replace


I want to solve a common but very specific problem: due to OCR errors, a lot of subtitle files contain the character "I" (upper case i) instead of "l" (lower case L).

My plan of attack is:

  1. Process the file word by word
  2. Pass each word to the hunspell spellchecker ("echo the-word | hunspell -l" produces no response at all if it is valid, and a response if it is bad)
  3. If it is a bad word, AND it has uppercase Is in it, then replace these with lowercase l and try again. If it is now a valid word, replace the original word.

I could certainly tokenize and reconstruct the entire file in a script, but before I go down that path I was wondering if it is possible to use awk and/or sed for these kinds of conditional operations at the word-level?

Any other suggested approaches would also be very welcome!


Solution

  • You don't really need more than bash for this:

    while read line; do
      words=( $line )
      for ((i=0; i<${#words[@]}; i++)); do
        word=${words[$i]}
        if [[ $(hunspell -l <<< $word) ]]; then
          # hunspell had some output
          tmp=${word//I/l}
          if [[ $tmp != $word ]] && [[ -z $(hunspell -l <<< $tmp) ]]; then
            # no output for new word, therefore it's a dictionary word
            words[$i]=$tmp
          fi
        fi
      done
      # print the new line
      echo "${words[@]}"
    done < filename > filename.new
    

    It does seem to make more sense to pass the whole file to hunspell, and parse the output of that.