In a script, I'm writing lines to a file, but some of the lines may be duplicates. So I've created a temporary cStringIO
file-like object, which I call my "intermediate file". I write the lines to the intermediate file first, remove duplicates, then write to the real file.
So I wrote a simple for loop to iterate through every line in my intermediate file and remove any duplicates.
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
cStringIO.OutputType.getvalue(f_temp) # From: https://stackoverflow.com/a/40553378/8117081
for line in f_temp: # Iterate through the cStringIO file-like object.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line)
lines_seen.add(line)
f_out.close()
My problem is that the for
loop never gets executed. I can verify this by putting in a breakpoint in my debugger; that line of the code just gets skipped and the function exits. I even read this answer from this thread and inserted the code cStringIO.OutputType.getvalue(f_temp)
, but that didn't solve my issue.
I'm lost as to why I can't read and iterate through my file-like object.
The answer you referenced was a little incomplete. It tells how to get the cStringIO buffer as a string, but then you have to do something with that string. You can do that like this:
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# contents = cStringIO.OutputType.getvalue(f_temp) # From: https://stackoverflow.com/a/40553378/8117081
contents = f_temp.getvalue() # simpler approach
contents = contents.strip('\n') # remove final newline to avoid adding an extra row
lines = contents.split('\n') # convert to iterable
for line in lines: # Iterate through the list of lines.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line + '\n')
lines_seen.add(line)
f_out.close()
But it is probably better to use normal IO operations on the f_temp "file handle", like this:
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# move f_temp's pointer back to the start of the file, to allow reading
f_temp.seek(0)
for line in f_temp: # Iterate through the cStringIO file-like object.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line)
lines_seen.add(line)
f_out.close()
Here's a test (with either one):
import cStringIO, os
def define_outputs(dir_out):
return open('/tmp/test.txt', 'w')
def compute_md5(line):
return line
f = cStringIO.StringIO()
f.write('string 1\n')
f.write('string 2\n')
f.write('string 1\n')
f.write('string 2\n')
f.write('string 3\n')
remove_duplicates(f, 'tmp')
with open('/tmp/test.txt', 'r') as f:
print(str([row for row in f]))
# ['string 1\n', 'string 2\n', 'string 3\n']