Search code examples
pythonalgorithmperformancelarge-datalarge-files

Time performance in Generating very large text file in Python


I need to generate a very large text file. Each line has a simple format:

Seq_num<SPACE>num_val
12343234 759

Let's assume I am going to generate a file with 100million lines. I tried 2 approaches and surprisingly they are giving very different time performance.

  1. For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val, and then I write that to a file. This approach takes a lot of time.

    ## APPROACH 1  
    for seq_id in seq_ids:
        num_val=rand()
        line=seq_id+' '+num_val
        data_file.write(line)
    
  2. For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val, and then I append this to a list. When loop finishes, I iterate over list items and write each item to a file. This approach takes far less time.

    ## APPROACH 2  
    data_lines=list()
    for seq_id in seq_ids:
        num_val=rand()
        l=seq_id+' '+num_val
        data_lines.append(l)
    for line in data_lines:
        data_file.write(line)
    

Note that:

  • Approach 2 has 2 loops instead of 1 loop.
  • I write to file in loop for both approach 1 and approach 2. So this step must be same for both.

So approach 1 must take less time. Any hints what I am missing?


Solution

  • Considering APPROACH 2, I think I can assume you have the data for all the lines (or at least in big chunks) before you need to write it to the file.

    The other answers are great and it was really formative to read them, but both focused on optimizing the file writing or avoiding the first for loop replacing with list comprehension (that is known to be faster).

    They missed the fact that you are iterating in a for loop to write the file, which is not really necessary.

    Instead of doing that, by increasing the use of memory (in this case is affordable, since a 100 million line file would be about 600 MB), you can create just one string in a more efficient way by using the formatting or join features of python str, and then write the big string to the file. Also relying on list comprehension to get the data to be formatted.

    With loop1 and loop2 of @Tombart 's answer, I get elapsed time 0:00:01.028567 and elapsed time 0:00:01.017042, respectively.

    While with this code:

    start = datetime.now()
    
    data_file = open('file.txt', 'w')
    data_lines = ( '%i %f\n'%(seq_id, random.random()) 
                                for seq_id in xrange(0, 1000000) )
    contents = ''.join(data_lines)
    data_file.write(contents) 
    
    end = datetime.now()
    print("elapsed time %s" % (end - start))
    

    I get elapsed time 0:00:00.722788 which is about a 25% faster.

    Notice that data_lines is a generator expression, so the list is not really stored in memory, and the lines are generated and consumed on demand by the join method. This implies the only variable that is significantly occupying memory is contents. This also reduces slightly the running times.

    If the text is to large to do all the work in memory, you can always separate in chunks. That is, formatting the string and writing to the file every million lines or so.

    Conclusions:

    • Always try to do list comprehension instead of plain for loops (list comprehension is even faster than filter for filtering lists see here).
    • If possible by memory or implementation constraints, try to create and encode string contents at once, using the format or join functions.
    • If possible and the code remains readable, use built-in functions to avoid for loops. For example, using extend function of a list instead of iterating and using append. In fact, both previous points can be seen as examples of this remark.

    Remark. Although this answer can be considered useful on its own, it does not completely address the question, which is why the two loops option in the question seems to run faster in some environments. For that, perhaps the @Aiken Drum's answer below can bring some light on that matter.