I want to extract in an automatic way images from a word document. The images are excel charts pasted as picture (enhanced metafile) into the worddoc.
After a quick research I tried to use the following method
import docx2txt as d2t
def extract_images_from_docx(path_to_file, images_folder, get_text = False):
text = d2t.process(path_to_file, images_folder)
if get_text:
return text
path_to_file = './Report.docx'
images_folder = './Img/'
extract_images_from_docx(path_to_file, images_folder, False)
However, this method does NOT work. I am almost sure that this is due to the format of the pictures. Indeed, when I pasted a normal png image into one word doc I was then able to get it with the above code.
I have also tried to convert the document to PDF and try to extract images from there with NO result
from docx2pdf import convert
convert('./Report.docx')
convert('./Report.docx', './Report.pdf')
import fitz # PyMuPDF
def get_pixmaps_in_pdf(pdf_filename):
doc = fitz.open(pdf_filename)
xrefs = set()
for page_index in range(doc.page_count):
for image in doc.get_page_images(page_index):
xrefs.add(image[0]) # Add XREFs to set so duplicates are ignored
pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
doc.close()
return pixmaps
def write_pixmaps_to_pngs(pixmaps):
for i, pixmap in enumerate(pixmaps):
pixmap.save(f'{i}.png') # Might want to come up with a better name
pixmaps = get_pixmaps_in_pdf('./Report.pdf')
write_pixmaps_to_pngs(pixmaps)
So, does anyone one know if there is a way to automatically extract excel charts pasted as enhanced metafile in a word doc?
Thank you in advance for your help !
The crazy thing is .docx
files are actually secretly .zip
files, I've been able to successfully extract images from a .docx
using the zipfile
module. The images should live in the word/media
directory of the extracted .zip
. I dunno if the enhanced metafiles live there too, but that's my best guess. Here's something to get you started:
import os
import zipfile
input_docx = [NAME_OF_DOCX]
archive = zipfile.ZipFile(f'{input_docx}.docx')
for file in archive.filelist:
archive.extract(file, 'extracted_docx')
for file in os.listdir('extracted_docx\\word\\media'):
if file.endswith('.emf'):
# do something with the file
pass
(untested, but should work)