from urllib.request import urlopen
from lxml import objectify
I am trying to write a program that will download XML files into a cache and then open them using objectify
. If I download the files using urlopen()
then I can read them in using objectify.fromstring()
just fine:
r = urlopen(my_url)
o = objectify.fromstring(r.read())
However, if I download them and write them to a file, I end up with an encoding declaration at the top of the file that objectify
doesn't like. To wit:
# download the file
my_file = 'foo.xml'
r = urlopen(my_url)
# save locally
with open(my_file, 'wb') as fp:
fp.write(r.read())
# open saved copy
with open(my_file, 'r') as fp:
o1 = objectify.fromstring(fp.read())
results in ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
If I use objectify.parse(fp)
then that works fine- soo-- I could go through and change all the client code to use parse()
instead, but I feel like that is not the right approach. I have other XML files stored locally for which .fromstring()
works just fine-- based on a cursory review they appear to have utf-8
encoding.
I just don't know what is the right resolution here- should I change the encoding when I save the file? should I strip the encoding declaration? should I fill my code with try.. except ValueError
clauses? please advise.
The file needs to be opened in binary mode rather than text mode.
open(my_file, 'rb') # b stands for binary
as suggested by the exception: ... Please use bytes input ...