I am trying to read a header from a word document using python-docx and watchdog. What I am doing is, whenever a new file is created or modified the script reads the file and get the contents in the header, but I am getting an
docx.opc.exceptions.PackageNotFoundError: Package not found at 'Test6.docx'
error and I tried everything including opening it as a stream but nothing has worked, and yes the document is populated. For reference, this is my code.
**main.py**
import time
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
import watchdog.observers
import watchdog.events
import os
import re
import xml.dom.minidom
import zipfile
from docx import Document
class Watcher:
DIRECTORY_TO_WATCH = "/path/to/my/directory"
def __init__(self):
self.observer = Observer()
def run(self):
event_handler = Handler()
self.observer.schedule(event_handler,path='C:/Users/abdsak11/OneDrive - Lärande', recursive=True)
self.observer.start()
try:
while True:
time.sleep(5)
except:
self.observer.stop()
print ("Error")
self.observer.join()
class Handler(FileSystemEventHandler):
@staticmethod
def on_any_event(event):
if event.is_directory:
return None
elif event.event_type == 'created':
# Take any action here when a file is first created.
path = event.src_path
extenstion = '.docx'
base = os.path.basename(path)
if extenstion in path:
print ("Received created event - %s." % event.src_path)
time.sleep(10)
print(base)
doc = Document(base)
print(doc)
section = doc.sections[0]
header = section.header
print (header)
elif event.event_type == 'modified':
# Taken any action here when a file is modified.
path = event.src_path
extenstion = '.docx'
base = os.path.basename(path)
if extenstion in base:
print ("Received modified event - %s." % event.src_path)
time.sleep(10)
print(base)
doc = Document(base)
print(doc)
section = doc.sections[0]
header = section.header
print (header)
if __name__ == '__main__':
w = Watcher()
w.run()
Edit: Tried to change the extension from doc to docx and that worked but is there anyway to open docx because thats what i am finding.
another thing. When opening the ".doc" file and trying to read the header all i am getting is
<docx.document.Document object at 0x03195488>
<docx.section._Header object at 0x0319C088>
and what i am trying to do is to extract the text from the header
You are trying to print the object itself, however you should access its property:
...
doc = Document(base)
section = doc.sections[0]
header = section.header
print(header.paragraphs[0].text)
according to https://python-docx.readthedocs.io/en/latest/user/hdrftr.html)
UPDATE
As I played with python-docx package, it turned out that PackageNotFoundError is very generic as it can occur simply because file is not accessible by some reason - not exist, not found or due to permissions, as well as if file is empty or corrupted. For example, in case of watchdog, it may very well happen that after triggering "created" event and before creating Document file can be renamed, deleted, etc. And for some reason you make this situation more probable by waiting 10 seconds before creating Document? So, try checking if file exists before:
if not os.path.exists(base):
raise OSError('{}: file does not exist!'.format(base))
doc = Document(base)
UPDATE2
Note also, that this may happen when opening program creates some lock file based on file name, e.g. running your code on linux and opening the file with libreoffice causes
PackageNotFoundError: Package not found at '.~lock.xxx.docx#'
because this file is not docx file! So you should update your filtering condition with
if path.endswith(extenstion):
...