I am trying to read a large number of zipped files (.zip or .docx) in a loop, each again containing a large number of embedded XML (.xml) files inside them. However some of the embedded XML files are broken/corrupted. I can create a parser which ignores the errors and loads the XML contents. However, I want to know which XML file is corrupted and which particular element inside it is broken/failing. I have tried the below code:
import re
import os
import zipfile
from lxml import etree
for file in os.listdir(filepath):
if file.endswith('.zip') or file.endswith('.docx'):
ext = os.path.splitext(file)[1]
newfile = f"{os.path.splitext(os.path.basename(file))[0]}_new{ext}"
zippedin = zipfile.ZipFile(os.path.join(filepath, file), 'r')
recovering_parser = etree.XMLParser(recover=True)
matched_items = []
for item in zippedin.infolist():
xmltree = etree.fromstring(zippedin.read(item.filename), parser=recovering_parser)
for node in xmltree.iter(tag=etree.Element):
if re.search('XXXXXXX', str(node)) or re.search('YYYYYYYY', str(node.attrib)):
matched_items.append(item)
with zipfile.ZipFile(os.path.join(filepath, newfile), 'w') as zippedout:
for element in matched_items:
zippedout.writestr(element, zippedin.read(element.filename))
zippedin.close()
This code snippet works perfectly fine and bypasses the broken XML files inside the zipped archives. However, I require to identify which files are failing and also the individual components. If I remove the recovering_parser
portion, I receive error messages of the following sort:
lxml.etree.XMLSyntaxError: xmlns: 'ABCDEFGHXXXX' is not a valid URI, line 5, column 45
It does not show which XML is corrupted. Can someone help me identify the broken XMLs and a proper way of exception handling and error scraping/extracting the faulty component name.
Using recovering_parser = etree.XMLParser(recover=True)
is preventing you from being able to catch which files are broken. In order to catch those errors, you can use a try/except
block.
import re
import os
import zipfile
from lxml import etree
try:
# xml parsing code here
except Exception as e:
# Debugging code here
print(file) # print the file name