Search code examples
pythonstringiocstringio

Cannot Iterate over cStringIO


In a script, I'm writing lines to a file, but some of the lines may be duplicates. So I've created a temporary cStringIO file-like object, which I call my "intermediate file". I write the lines to the intermediate file first, remove duplicates, then write to the real file.

So I wrote a simple for loop to iterate through every line in my intermediate file and remove any duplicates.

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()

My problem is that the for loop never gets executed. I can verify this by putting in a breakpoint in my debugger; that line of the code just gets skipped and the function exits. I even read this answer from this thread and inserted the code cStringIO.OutputType.getvalue(f_temp), but that didn't solve my issue.

I'm lost as to why I can't read and iterate through my file-like object.


Solution

  • The answer you referenced was a little incomplete. It tells how to get the cStringIO buffer as a string, but then you have to do something with that string. You can do that like this:

    def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
        """Function to remove duplicates from the intermediate file and write to physical file."""
        lines_seen = set()  # Define a set to hold lines already seen.
        f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
    
        # contents = cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081
        contents = f_temp.getvalue()     # simpler approach
        contents = contents.strip('\n')  # remove final newline to avoid adding an extra row
        lines = contents.split('\n')     # convert to iterable
    
        for line in lines:  # Iterate through the list of lines.
            line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
            if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
                f_out.write(line + '\n')
                lines_seen.add(line)
        f_out.close()
    

    But it is probably better to use normal IO operations on the f_temp "file handle", like this:

    def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
        """Function to remove duplicates from the intermediate file and write to physical file."""
        lines_seen = set()  # Define a set to hold lines already seen.
        f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
    
        # move f_temp's pointer back to the start of the file, to allow reading
        f_temp.seek(0)
    
        for line in f_temp:  # Iterate through the cStringIO file-like object.
            line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
            if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
                f_out.write(line)
                lines_seen.add(line)
        f_out.close()
    

    Here's a test (with either one):

    import cStringIO, os
    
    def define_outputs(dir_out):
        return open('/tmp/test.txt', 'w') 
    
    def compute_md5(line):
        return line
    
    f = cStringIO.StringIO()
    f.write('string 1\n')
    f.write('string 2\n')
    f.write('string 1\n')
    f.write('string 2\n')
    f.write('string 3\n')
    
    remove_duplicates(f, 'tmp')
    with open('/tmp/test.txt', 'r') as f:
        print(str([row for row in f]))
    # ['string 1\n', 'string 2\n', 'string 3\n']