I am trying to copy images from one word document to the other. For that, I extracted all the images from the word document into a folder(img_folder) using the following code:
docx2txt.process('Sample2.docx', 'img_folder/')
Using the above code the images get saved into the img_folder in jpeg/png format However, when I try to insert the images from this folder I get the following error:
File "D:/Pycharm Projects/pydocx/newaddImg.py", line 60, in copy
newp.add_run().add_picture(imgPath)
File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\text\run.py", line 62, in add_picture
inline = self.part.new_pic_inline(image_path_or_stream, width, height)
File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\parts\story.py", line 56, in new_pic_inline
rId, image = self.get_or_add_image(image_descriptor)
File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\parts\story.py", line 29, in get_or_add_image
image_part = self._package.get_or_add_image_part(image_descriptor)
File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\package.py", line 31, in get_or_add_image_part
return self.image_parts.get_or_add_image_part(image_descriptor)
File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\package.py", line 74, in get_or_add_image_part
image = Image.from_file(image_descriptor)
File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\image\image.py", line 55, in from_file
return cls._from_stream(stream, blob, filename)
File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\image\image.py", line 176, in _from_stream
image_header = _ImageHeaderFactory(stream)
File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\image\image.py", line 199, in _ImageHeaderFactory
raise UnrecognizedImageError
docx.image.exceptions.UnrecognizedImageError
Can anyone suggest how to resolve this error.
There are some images, perhaps taken on an older iPhone, that python-docx
can't recognize. These images omit some parts of the standard image header and that prevents python-docx
from recognizing them.
The fix is to load the image into Pillow and then save it as the same type. This process restores the header and python-docx
should then be able to read it without error.
Something roughly like:
from PIL import Image
Image.open("corrupted-header.jpg")
Image.save("fixed-header.jpg")
You could do this in a try-except block and only process the files that raised UnrecognizedImageError
, or you could pre-emptively pre-process all the image files.
The relevant parts of the Pillow documentation are here:
https://pillow.readthedocs.io/en/stable/reference/Image.html