Search code examples
pythonmemorypython-2.7bufferstringio

How to overcome memory issue when sequentially appending files to one another


I am running the following script in order to append files to one another by cycling through months and years if the file exists, I have just tested it with a larger dataset where I would expect the output file to be roughly 600mb in size. However I am running into memory issues. Firstly is this normal to run into memory issues (my pc has 8 gb ram) I am not sure how I am eating all of this memory space?

Code I am running

import datetime,  os
import StringIO

stored_data = StringIO.StringIO()

start_year = "2011"
start_month = "November"
first_run = False

current_month = datetime.date.today().replace(day=1)
possible_month = datetime.datetime.strptime('%s %s' % (start_month, start_year), '%B %Y').date()
while possible_month <= current_month:
    csv_filename = possible_month.strftime('%B %Y') + ' MRG.csv'
    if os.path.exists(csv_filename):
        with open(csv_filename, 'rb') as current_csv:
            if first_run != False:
                next(current_csv)
            else:
                first_run = True
            stored_data.writelines(current_csv)
    possible_month = (possible_month + datetime.timedelta(days=31)).replace(day=1)
if stored_data:
    contents = stored_data.getvalue()
    with open('FullMergedData.csv', 'wb') as output_csv:
        output_csv.write(contents)

The trackback I receive:

Traceback (most recent call last):
  File "C:\code snippets\FullMerger.py", line 23, in <module>
    contents = stored_output.getvalue()
  File "C:\Python27\lib\StringIO.py", line 271, in getvalue
    self.buf += ''.join(self.buflist)
MemoryError

Any ideas how to achieve a work around or make this code more efficient to overcome this issue. Many thanks
AEA

Edit1

Upon running the code supplied alKid I received the following traceback.

Traceback (most recent call last):
  File "C:\FullMerger.py", line 22, in <module>
    output_csv.writeline(line)
AttributeError: 'file' object has no attribute 'writeline'

I fixed the above by changing it to writelines however I still received the following trace back.

Traceback (most recent call last):
  File "C:\FullMerger.py", line 19, in <module>
    next(current_csv)
StopIteration

Solution

  • In stored_data, you're trying to store the whole file, and since it's too large, you're getting the error you are showing.

    One solution is to write the file per line. It is far more memory-efficient, since you only store a line of data in the buffer, instead of the whole 600 MB.

    In short, the structure can be something this:

    with open('FullMergedData.csv', 'a') as output_csv: #this will append  
    # the result to the file.
        with open(csv_filename, 'rb') as current_csv:
            for line in current_csv:   #loop through the lines
                if first_run != False:
                    next(current_csv)
                    first_run = True #After the first line,
                    #you should immidiately change first_run to true.
                output_csv.writelines(line)  #write it per line
    

    Should fix your problem. Hope this helps!