Search code examples
pythonxmllxmlxml-error

Identify broken XML files inside a zipped archive


I am trying to read a large number of zipped files (.zip or .docx) in a loop, each again containing a large number of embedded XML (.xml) files inside them. However some of the embedded XML files are broken/corrupted. I can create a parser which ignores the errors and loads the XML contents. However, I want to know which XML file is corrupted and which particular element inside it is broken/failing. I have tried the below code:

import re
import os
import zipfile
from lxml import etree

for file in os.listdir(filepath):
    if file.endswith('.zip') or file.endswith('.docx'):
        ext = os.path.splitext(file)[1]
        newfile = f"{os.path.splitext(os.path.basename(file))[0]}_new{ext}"

        zippedin = zipfile.ZipFile(os.path.join(filepath, file), 'r')
        recovering_parser = etree.XMLParser(recover=True)
        matched_items = []

        for item in zippedin.infolist():
            xmltree = etree.fromstring(zippedin.read(item.filename), parser=recovering_parser)

            for node in xmltree.iter(tag=etree.Element):
                if re.search('XXXXXXX', str(node)) or re.search('YYYYYYYY', str(node.attrib)):
                    matched_items.append(item)

        with zipfile.ZipFile(os.path.join(filepath, newfile), 'w') as zippedout:
            for element in matched_items:
                zippedout.writestr(element, zippedin.read(element.filename))
        zippedin.close()

This code snippet works perfectly fine and bypasses the broken XML files inside the zipped archives. However, I require to identify which files are failing and also the individual components. If I remove the recovering_parser portion, I receive error messages of the following sort:

lxml.etree.XMLSyntaxError: xmlns: 'ABCDEFGHXXXX' is not a valid URI, line 5, column 45

It does not show which XML is corrupted. Can someone help me identify the broken XMLs and a proper way of exception handling and error scraping/extracting the faulty component name.


Solution

  • Using recovering_parser = etree.XMLParser(recover=True) is preventing you from being able to catch which files are broken. In order to catch those errors, you can use a try/except block.

    import re
    import os
    import zipfile
    from lxml import etree
    
    try:
        # xml parsing code here
    except Exception as e:
        # Debugging code here
        print(file) # print the file name