Search code examples
pythonfor-loopreadlinescontrol-characters

Why does '\x01\x1A' (Start-of-Header and Substitute control characters) in a textfile line stop a for-loop prematurely?


I'm using Python 2.7.15, Windows 7

Context

I wrote a script to read and tokenize each line of a FileZilla log file (specifications here) for the IP address of the host that initiated the connection to the FileZilla server. I'm having trouble parsing the log text field that follows the > character. The script I wrote uses the:

    with open('fz.log','r') as rh:
       for lineno, line in rh: 
          pass

construct to read each line. That for-loop stopped prematurely when it encountered a log text field that contained the SOH and SUB characters. I can't show you the log file since it contains sensitive information but the crux of the problem can be reproduced by reading a textfile that contains those characters on a line.

My goal is to extract the IP addresses (which I can do using re.search()) but before that happens, I have to remove those control characters. I do this by creating a copy of the log file where the lines containing those control characters are removed. There's probably a better way, but I'm more curious why the for-loop just stops after encountering the control characters.

Reproducing the Issue

I reproduced the problem with this code:

if __name__ == '__main__':
    fn = 'writetest.txt'
    fn2 = 'writetest_NoControlChars.txt'

    # Create the problematic textfile
    with open(fn, 'w') as wh: 
        wh.write("This line comes first!\n");
        wh.write("Blah\x01\x1A\n"); # Write Start-of-Header and Subsitute unicode character to line
        wh.write("This comes after!")

    # Try to read the file above, removing the SOH/SUB characters if encountered
    with open(fn, 'r') as rh:
        with open(fn2, 'w') as wh:
            for lineno, line in enumerate(rh):
                sline = line.translate(None,'\x01\x1A')
                wh.write(sline)
                print "Line #{}: {}".format(lineno, sline)
    print "Program executed."

Output

The code above creates 2 output files and produces the following in a console window:

Line #0: This line comes first!

Line #1: Blah
Program executed.

I step-debugged through the code in Eclipse and immediately after executing the

for lineno, line in enumerate(rh): 

statement, rh, the handle for that opened file was closed. I had expected it to move onto the third line, printing out This comes after! to console and writing it out to writetest_NoControlChars.txt but neither events happened. Instead, execution jumped to print "Program executed". Picture of Local Variable values in Debug Console


Solution

  • You have to open this file in binary mode if you know it contains non-text data: open(fn, 'rb')