Search code examples
shelltextdoc

MS Word Doc: Automating find/replace using Shell Scripts


I have a number of word documents that I'd like to remove some elements from. What I would like to do is as follows:

  1. Copy and paste the entire contents of the word file (may not be necessary) and move it into a text file OR Convert .doc to .txt
  2. Using regex: replace \[.*\] with "" AND replace \(.*\) with ""
  3. Save the result to a text file with the same name as the original word document.

Thoughts and direction appreciated. As it stands now, I don't know how to do any of these things programatically. I'm doing this manually as it stands.

If it matters, I'm using Ubuntu 11.04


Solution

  • Since you're open to using plain text, some improvements to your algo:

    1. Use antiword to automate conversion from doc to tx
    2. Use sed to do in-place regex modification: sed -i -e's/bad/good/' file.txt

    Update (in response to comment):

    The regexes are fine, but I didn't understand the objective completely:

    • if you want to replace occurrences of [foo] & (foo) with "" use:

      sed -i -e's/\[.*\]/""/g' file.txt; sed -i -e's/\(.*\)/""/g' file.txt

    • if you want to replace occurrences [foo] & (foo) with "foo" each use:

      sed -i -e's/\[\(.*\)\]/"\1"/g' file.txt; sed -i -e's/(\(.*\))/"\1"/g' file.txt