I need to generate a very large text file. Each line has a simple format:
Seq_num<SPACE>num_val
12343234 759
Let's assume I am going to generate a file with 100million lines. I tried 2 approaches and surprisingly they are giving very different time performance.
For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val
, and then I write that to a file.
This approach takes a lot of time.
## APPROACH 1
for seq_id in seq_ids:
num_val=rand()
line=seq_id+' '+num_val
data_file.write(line)
For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val
, and then I append this to a list.
When loop finishes, I iterate over list items and write each item to a file.
This approach takes far less time.
## APPROACH 2
data_lines=list()
for seq_id in seq_ids:
num_val=rand()
l=seq_id+' '+num_val
data_lines.append(l)
for line in data_lines:
data_file.write(line)
Note that:
So approach 1 must take less time. Any hints what I am missing?
Considering APPROACH 2, I think I can assume you have the data for all the lines (or at least in big chunks) before you need to write it to the file.
The other answers are great and it was really formative to read them, but both focused on optimizing the file writing or avoiding the first for loop replacing with list comprehension (that is known to be faster).
They missed the fact that you are iterating in a for loop to write the file, which is not really necessary.
Instead of doing that, by increasing the use of memory (in this case is affordable, since a 100 million line file would be about 600 MB), you can create just one string in a more efficient way by using the formatting or join features of python str, and then write the big string to the file. Also relying on list comprehension to get the data to be formatted.
With loop1 and loop2 of @Tombart 's answer, I get elapsed time 0:00:01.028567
and elapsed time 0:00:01.017042
, respectively.
While with this code:
start = datetime.now()
data_file = open('file.txt', 'w')
data_lines = ( '%i %f\n'%(seq_id, random.random())
for seq_id in xrange(0, 1000000) )
contents = ''.join(data_lines)
data_file.write(contents)
end = datetime.now()
print("elapsed time %s" % (end - start))
I get elapsed time 0:00:00.722788
which is about a 25% faster.
Notice that data_lines
is a generator expression, so the list is not really stored in memory, and the lines are generated and consumed on demand by the join
method. This implies the only variable that is significantly occupying memory is contents
. This also reduces slightly the running times.
If the text is to large to do all the work in memory, you can always separate in chunks. That is, formatting the string and writing to the file every million lines or so.
Conclusions:
filter
for filtering lists see here).format
or join
functions.for
loops. For example, using extend
function of a list instead of iterating and using append
. In fact, both previous points can be seen as examples of this remark.Remark. Although this answer can be considered useful on its own, it does not completely address the question, which is why the two loops option in the question seems to run faster in some environments. For that, perhaps the @Aiken Drum's answer below can bring some light on that matter.