Search code examples
pythondocxpython-docx

Remove all images from docx files


I've searched the documentation for python-docx and other packages, as well as stack-overflow, but could not find how to remove all images from docx files with python.

My exact use-case: I need to convert hundreds of word documents to "draft" format to be viewed by clients. Those drafts should be identical the original documents but all the images must be deleted / redacted from them.

Sorry for not including an example of things I tried, what I have tried is hours of research that didn't give any info. I found this question on how to extract images from word files, but that doesn't delete them from the actual document: Extract pictures from Word and Excel with Python

From there and other sources I've found out that docx files could be read as simple zip files, I don't know if that means that it's possible to "re-zip" without the images without affecting the integrity of the docx file (edit: simply deleting the images works, but prevents python-docx from continuing to work with this file because of missing references to images), but thought this might be a path to a solution.

Any ideas?


Solution

  • If your goal is to redact images maybe this code I used for a similar usecase could be useful:

    import sys
    import zipfile
    from PIL import Image, ImageFilter
    import io
    
    blur = ImageFilter.GaussianBlur(40)
    
    def redact_images(filename):
        outfile = filename.replace(".docx", "_redacted.docx")
        with zipfile.ZipFile(filename) as inzip:
            with zipfile.ZipFile(outfile, "w") as outzip:
                for info in inzip.infolist():
                    name = info.filename
                    print(info)
                    content = inzip.read(info)
                    if name.endswith((".png", ".jpeg", ".gif")):
                            fmt = name.split(".")[-1]
                            img = Image.open(io.BytesIO(content))
                            img = img.convert().filter(blur)
                            outb = io.BytesIO()
                            img.save(outb, fmt)
                            content = outb.getvalue()
                            info.file_size = len(content)
                            info.CRC = zipfile.crc32(content)
                    outzip.writestr(info, content)
    

    Here I used PIL to blur images in some files, but instead of the blur filter any other suitable operation could be used. This worked quite nicely for my usecase.