I'm having a go at a fixing a broken lib that I want to use on Github.
I have locally "fixed" the problem. but I don't think its a very clean method...
I'm poking the WARC library by the internet archive, and spcifically the arc.py part (https://github.com/internetarchive/warc/blob/master/warc/arc.py).
Since the lib was written, the tools that make the ARC files have changed a bit, and as a result, the builtin parser fails, as its not expecting to see some metadata in the file.
My local fix looks like this:
if header.startswith("<arcmetadata"):
while not header.endswith("</arcmetadata>\n"):
header = self.fileobj.readline()
header = self.fileobj.readline()
header = self.fileobj.readline()
And I'm not sure that my calling of readlines()
twice to strip of the next two empty lines (containing "/n"
is the cleanest way of advancing through the fileobject.
Is this good python? or is there a better way?
The code looks like a copy/paste error. There is nothing wrong with using .readline()
, just document what you are doing:
# skip metadata
if header.startswith("<arcmetadata"):
while not header.endswith("</arcmetadata>\n"):
header = self.fileobj.readline()
#NOTE: header ends with `"</arc..."` here i.e., it is not blank
# skip blank lines
while not header.strip():
header = self.fileobj.readline()
btw, if the file contains xml then use an xml parser to parse it. Don't do it by hand.