Search code examples
pythonpython-docx

jpeg/ png Image Insertion Error- python-docx


I am trying to copy images from one word document to the other. For that, I extracted all the images from the word document into a folder(img_folder) using the following code:

docx2txt.process('Sample2.docx', 'img_folder/')

Using the above code the images get saved into the img_folder in jpeg/png format However, when I try to insert the images from this folder I get the following error:

File "D:/Pycharm Projects/pydocx/newaddImg.py", line 60, in copy
    newp.add_run().add_picture(imgPath)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\text\run.py", line 62, in add_picture
    inline = self.part.new_pic_inline(image_path_or_stream, width, height)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\parts\story.py", line 56, in new_pic_inline
    rId, image = self.get_or_add_image(image_descriptor)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\parts\story.py", line 29, in get_or_add_image
    image_part = self._package.get_or_add_image_part(image_descriptor)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\package.py", line 31, in get_or_add_image_part
    return self.image_parts.get_or_add_image_part(image_descriptor)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\package.py", line 74, in get_or_add_image_part
    image = Image.from_file(image_descriptor)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\image\image.py", line 55, in from_file
    return cls._from_stream(stream, blob, filename)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\image\image.py", line 176, in _from_stream
    image_header = _ImageHeaderFactory(stream)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\image\image.py", line 199, in _ImageHeaderFactory
    raise UnrecognizedImageError
docx.image.exceptions.UnrecognizedImageError

Can anyone suggest how to resolve this error.


Solution

  • There are some images, perhaps taken on an older iPhone, that python-docx can't recognize. These images omit some parts of the standard image header and that prevents python-docx from recognizing them.

    The fix is to load the image into Pillow and then save it as the same type. This process restores the header and python-docx should then be able to read it without error.

    Something roughly like:

    from PIL import Image
    
    Image.open("corrupted-header.jpg")
    Image.save("fixed-header.jpg")
    

    You could do this in a try-except block and only process the files that raised UnrecognizedImageError, or you could pre-emptively pre-process all the image files.

    The relevant parts of the Pillow documentation are here:
    https://pillow.readthedocs.io/en/stable/reference/Image.html