Search code examples
pythonreadlines

Python - Handling an nth line hop with readlines()


I'm having a go at a fixing a broken lib that I want to use on Github.

I have locally "fixed" the problem. but I don't think its a very clean method...

I'm poking the WARC library by the internet archive, and spcifically the arc.py part (https://github.com/internetarchive/warc/blob/master/warc/arc.py).

Since the lib was written, the tools that make the ARC files have changed a bit, and as a result, the builtin parser fails, as its not expecting to see some metadata in the file.

My local fix looks like this:

    if header.startswith("<arcmetadata"):
        while not header.endswith("</arcmetadata>\n"):
            header = self.fileobj.readline()
        header = self.fileobj.readline()
        header = self.fileobj.readline()

And I'm not sure that my calling of readlines() twice to strip of the next two empty lines (containing "/n" is the cleanest way of advancing through the fileobject.

Is this good python? or is there a better way?


Solution

  • The code looks like a copy/paste error. There is nothing wrong with using .readline(), just document what you are doing:

    # skip metadata
    if header.startswith("<arcmetadata"):
        while not header.endswith("</arcmetadata>\n"):
            header = self.fileobj.readline()
        #NOTE: header ends with `"</arc..."` here i.e., it is not blank
    
    # skip blank lines
    while not header.strip():
        header = self.fileobj.readline()
    

    btw, if the file contains xml then use an xml parser to parse it. Don't do it by hand.