I wrote a script reading XML files using minidom:
from xml.dom.minidom import parse
for File in Data['FileList']:
Xml = parse(File)
#do something
which runs fine, but some guys are creating XMLs defining UTF-8 encoding in the XML and using German Umlaute in tags so I ran into xml.parsers.expat.ExpatError: not well-formed (invalid token).
If I change manually in the XML to encoding="ISO-8859-1" it runs fine.
Is there a more elegant way of changing the encoding, instead of editing the XML files, e.g. telling minidom to use a different encoding than defined in the XML?
I suggest you this solution:
Before parsing the file, open it normally and replace the first line of it which corresponds to the XML header with this line:
<?xml version="1.0" encoding="ISO-8859-1"?>
You then save the file and passe it to minidom.parse()
function.
This may help you to replace the first line line in each file: Search and replace a line in a file in Python