Search code examples
grepunzipdoc

command line to convert all .docx in a directory (and subdirectories) to text file and write new files


I would like to convert all .docx files in a directory (and subdirectories) to text files from the command line (so I can use grep after on these files). I found this

unzip -p tutu.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'

here which works well but it sends the file in the terminal. I would like to write the new text file (.txt for instance) in the same directory as the .docx file. And I would like a script to do this recursively.

I have this, using antiword, that do what I want for .doc files but it doesn't work for .docx files.

find . -name '*.doc' | while read i; do antiword -i 1 "${i}" >"${i/doc/txt}"; done

I tried to mix both but without success... A command line that would do both at the same time would be appreciated!

Thank you


Solution

  • The following script..

    • converts all docx files in the directory where you run it, recursively (adapt . in find . to your wished starting point)
    • writes the txt files to where it found the docx file

    Bash script:

    find . -name "*.docx" | while read file; do
        unzip -p $file word/document.xml |
            sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' > "${file/docx/txt}"
    done
    

    Afterwards you can run the grep like this:

    grep -r "some text" --include "*.txt" .