I would like to convert all .docx files in a directory (and subdirectories) to text files from the command line (so I can use grep after on these files). I found this
unzip -p tutu.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'
here which works well but it sends the file in the terminal. I would like to write the new text file (.txt for instance) in the same directory as the .docx file. And I would like a script to do this recursively.
I have this, using antiword, that do what I want for .doc files but it doesn't work for .docx files.
find . -name '*.doc' | while read i; do antiword -i 1 "${i}" >"${i/doc/txt}"; done
I tried to mix both but without success... A command line that would do both at the same time would be appreciated!
Thank you
The following script..
.
in find .
to your wished starting point)docx
fileBash script:
find . -name "*.docx" | while read file; do
unzip -p $file word/document.xml |
sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' > "${file/docx/txt}"
done
Afterwards you can run the grep like this:
grep -r "some text" --include "*.txt" .