I'm using Python 2.7.15, Windows 7
Context
I wrote a script to read and tokenize each line of a FileZilla log file (specifications here) for the IP address of the host that initiated the connection to the FileZilla server. I'm having trouble parsing the log text
field that follows the >
character. The script I wrote uses the:
with open('fz.log','r') as rh:
for lineno, line in rh:
pass
construct to read each line. That for-loop stopped prematurely when it encountered a log text
field that contained the SOH
and SUB
characters. I can't show you the log file since it contains sensitive information but the crux of the problem can be reproduced by reading a textfile that contains those characters on a line.
My goal is to extract the IP addresses (which I can do using re.search()
) but before that happens, I have to remove those control characters. I do this by creating a copy of the log file where the lines containing those control characters are removed. There's probably a better way, but I'm more curious why the for-loop just stops after encountering the control characters.
Reproducing the Issue
I reproduced the problem with this code:
if __name__ == '__main__':
fn = 'writetest.txt'
fn2 = 'writetest_NoControlChars.txt'
# Create the problematic textfile
with open(fn, 'w') as wh:
wh.write("This line comes first!\n");
wh.write("Blah\x01\x1A\n"); # Write Start-of-Header and Subsitute unicode character to line
wh.write("This comes after!")
# Try to read the file above, removing the SOH/SUB characters if encountered
with open(fn, 'r') as rh:
with open(fn2, 'w') as wh:
for lineno, line in enumerate(rh):
sline = line.translate(None,'\x01\x1A')
wh.write(sline)
print "Line #{}: {}".format(lineno, sline)
print "Program executed."
Output
The code above creates 2 output files and produces the following in a console window:
Line #0: This line comes first!
Line #1: Blah
Program executed.
I step-debugged through the code in Eclipse and immediately after executing the
for lineno, line in enumerate(rh):
statement, rh
, the handle for that opened file was closed. I had expected it to move onto the third line, printing out This comes after!
to console and writing it out to writetest_NoControlChars.txt
but neither events happened. Instead, execution jumped to print "Program executed"
.
Picture of Local Variable values in Debug Console
You have to open this file in binary mode if you know it contains non-text data: open(fn, 'rb')