Search code examples
pythonpython-docx

Removing personal information from the comments in a word file using python


I want to remove all the personal information from the comments inside a word file.

Removing the Authors name is fine, I did that using the following,

document = Document('sampleFile.docx')
core_properties = document.core_properties
core_properties.author = ""
document.save('new-filename.docx')

But this is not what I need, I want to remove the name of any person who commented inside that word file.

The way we do it manually is by going into Preferences->security->remove personal information from this file on save


Solution

  • If you want to remove personal information from the comments in .docx file, you'll have to dive deep into the file itself.

    So, .docx is just a .zip archive with word-specific files. We need to overwrite some internal files of it, and the easiest way to do it that I could find is to copy all the files to memory, change whatever we have to change and put it all to a new file.

    import re
    import os
    from zipfile import ZipFile
    
    docx_file_name = '/path/to/your/document.docx'
    
    files = dict()
    
    # We read all of the files and store them in "files" dictionary.
    document_as_zip = ZipFile(docx_file_name, 'r')
    for internal_file in document_as_zip.infolist():
        file_reader = document_as_zip.open(internal_file.filename, "r")
        files[internal_file.filename] = file_reader.readlines()
        file_reader.close()
    
    # We don't need to read anything more, so we close the file.
    document_as_zip.close()
    
    # If there are any comments.
    if "word/comments.xml" in files.keys():
        # We will be working on comments file...
        comments = files["word/comments.xml"]
    
        comments_new = str()
    
        # Files contents have been read as list of byte strings.
        for comment in comments:
            if isinstance(comment, bytes):
                # Change every author to "Unknown Author".
                comments_new += re.sub(r'w:author="[^"]*"', "w:author=\"Unknown Author\"", comment.decode())
    
        files["word/comments.xml"] = comments_new
    
    # Remove the old .docx file.
    os.remove(docx_file_name)
    
    # Now we want to save old files to the new archive.
    document_as_zip = ZipFile(docx_file_name, 'w')
    for internal_file_name in files.keys():
        # Those are lists of byte strings, so we merge them...
        merged_binary_data = str()
        for binary_data in files[internal_file_name]:
            # If the file was not edited (therefore is not the comments.xml file).
            if not isinstance(binary_data, str):
                binary_data = binary_data.decode()
    
            # Merge file contents.
            merged_binary_data += binary_data
    
        # We write old file contents to new file in new .docx.
        document_as_zip.writestr(internal_file_name, merged_binary_data)
    
    # Close file for writing.
    document_as_zip.close()