Search code examples
python-3.xtext-extraction.doc

Reading .doc files in python on windows 10


Note: This was flagged as a potential duplicate of this, but the point of my question is that using textract doesn't work. I am looking either for (a) a way to get textract to work on windows 10 or (b) an alternate solution.

I am building a system that needs to read various types of files. I have set up pdfminer to read the .pdfs, and based on the process outlined here I installed textract, and I can now also read .docx files. However textract relies on antiword for reading .doc files and I cannot get this to work, even after following the directions here I could not find and install a working version of antiword. I do not have microsoft word installed on my machine, and I am running windows 10 with python 3.6.5. Is there any other way to read .doc files?

Here is the bug when running textract.process('d.doc') (ignore the first error, the file is definitely there):

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py", line 84, in run
    stdout=subprocess.PIPE, stderr=subprocess.PIPE,
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 997, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\__init__.py", line 77, in process
    return parser.process(filename, encoding, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py", line 46, in process
    byte_string = self.extract(filename, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\doc_parser.py", line 9, in extract
    stdout, stderr = self.run(['antiword', filename])
  File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py", line 91, in run
    ' '.join(args), 127, '', '',
textract.exceptions.ShellError: The command antiword d.doc failed with exit code 127

Solution

  • I was able to get part of the text using olefile, but olefile ultimately only handles bytes and does not handle the encoding of Word .doc files. The solution is to use LibreOffice, see my other question here