Search code examples
pythonms-wordpython-docxpython-watchdog

python reading header from word docx


I am trying to read a header from a word document using python-docx and watchdog. What I am doing is, whenever a new file is created or modified the script reads the file and get the contents in the header, but I am getting an

docx.opc.exceptions.PackageNotFoundError: Package not found at 'Test6.docx'

error and I tried everything including opening it as a stream but nothing has worked, and yes the document is populated. For reference, this is my code.

**main.py**
    import time
    from watchdog.observers import Observer
    from watchdog.events import FileSystemEventHandler
    import watchdog.observers
    import watchdog.events
    import os
    import re
    import xml.dom.minidom
    import zipfile
    from docx import Document


    class Watcher:
        DIRECTORY_TO_WATCH = "/path/to/my/directory"

        def __init__(self):
            self.observer = Observer()

        def run(self):
            event_handler = Handler()
            self.observer.schedule(event_handler,path='C:/Users/abdsak11/OneDrive - Lärande', recursive=True)
            self.observer.start()
            try:
                while True:
                    time.sleep(5)
            except:
                self.observer.stop()
                print ("Error")

            self.observer.join()


    class Handler(FileSystemEventHandler):

        @staticmethod
        def on_any_event(event):
            if event.is_directory:
                return None

            elif event.event_type == 'created':
                # Take any action here when a file is first created.
                path = event.src_path
                extenstion = '.docx'
                base = os.path.basename(path)

                if extenstion in path:
                    print ("Received created event - %s." % event.src_path)
                    time.sleep(10)
                    print(base)
                    doc = Document(base)
                    print(doc)
                    section = doc.sections[0]
                    header = section.header
                    print (header)



            elif event.event_type == 'modified':
                # Taken any action here when a file is modified.
                path = event.src_path
                extenstion = '.docx'
                base = os.path.basename(path)
                if extenstion in base:
                    print ("Received modified event - %s." % event.src_path)
                    time.sleep(10)
                    print(base)
                    doc = Document(base)
                    print(doc)
                    section = doc.sections[0]
                    header = section.header
                    print (header)



    if __name__ == '__main__':
        w = Watcher()
        w.run()

Edit: Tried to change the extension from doc to docx and that worked but is there anyway to open docx because thats what i am finding.

another thing. When opening the ".doc" file and trying to read the header all i am getting is

<docx.document.Document object at 0x03195488>
<docx.section._Header object at 0x0319C088>

and what i am trying to do is to extract the text from the header


Solution

  • You are trying to print the object itself, however you should access its property:

    ...
    doc = Document(base)
    section = doc.sections[0]
    header = section.header
    print(header.paragraphs[0].text)
    

    according to https://python-docx.readthedocs.io/en/latest/user/hdrftr.html)

    UPDATE

    As I played with python-docx package, it turned out that PackageNotFoundError is very generic as it can occur simply because file is not accessible by some reason - not exist, not found or due to permissions, as well as if file is empty or corrupted. For example, in case of watchdog, it may very well happen that after triggering "created" event and before creating Document file can be renamed, deleted, etc. And for some reason you make this situation more probable by waiting 10 seconds before creating Document? So, try checking if file exists before:

    if not os.path.exists(base):
        raise OSError('{}: file does not exist!'.format(base))
    doc = Document(base)
    

    UPDATE2

    Note also, that this may happen when opening program creates some lock file based on file name, e.g. running your code on linux and opening the file with libreoffice causes

    PackageNotFoundError: Package not found at '.~lock.xxx.docx#'
    

    because this file is not docx file! So you should update your filtering condition with

    if path.endswith(extenstion):
    ...