Search code examples
pythoncsvdictionarygeneratoravro

Creating list of dictionaries from big csv


I've a very big csv file (10 gb) and I'd like to read it and create a list of dictionaries where each dictionary represent a line in the csv. Something like

[{'value1': '20150302', 'value2': '20150225','value3': '5', 'IS_SHOP': '1', 'value4': '0', 'value5': 'GA321D01H-K12'},
{'value1': '20150302', 'value2': '20150225', 'value3': '1', 'value4': '0', 'value5': '1', 'value6': 'GA321D01H-K12'}]

I'm trying to achieve it using a generator in order to avoid any memories issues, my current code is the following:

def csv_reader():
    with open('export.csv') as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield {key: value for key, value in row.items()}

generator = csv_reader() 
list = []
for i in generator:
    list.append(i)

The problem is that basically it runs out of memory because of the list becoming too big and the process is killed. Is there a way to achieve the same result (list of dictonaries) in an efficient way? I'm very new to generators/yield so I don't even know if I'm using it correctly.

I also tried to use a virtual environment with pypy but the memory breaks anyway (a little later though).

Basically the reason why I want a list of dictionaries it that I want to try to convert the csv into an avro format using fastavro so any hints on how using fastavro (https://pypi.python.org/pypi/fastavro) without creating a list of dictionaries would be appreciated


Solution

  • If the goal is to convert from csv to avro, there is no reason to store a complete list of the input values. That's defeating the whole purpose of using the generator. It looks like, after setting up a schema, fastavro's writer is designed to take an iterable and write it out one record at a time, so you can just pass it the generator directly. For example, your code would simply omit the step of creating the list (side-note: Naming variables list is a bad idea, since it shadows/stomps the builtin name list), and just write the generator directly:

    from fastavro import writer
    
    def csv_reader():
        with open('export.csv') as f:
            reader = csv.DictReader(f)
            for row in reader:
                yield row
    
        # If this is Python 3.3+, you could simplify further to just:
        with open('export.csv') as f:
            yield from csv.DictReader(f)
    
    # schema could be from the keys of the first row which gets manually written
    # or you can provide an explicit schema with documentation for each field
    schema = {...}  
    
    with open('export.avro', 'wb') as out:
        writer(out, schema, csv_reader())
    

    The generator then produces one row at a time, and writer writes one row at a time. The input rows are discarded after writing, so memory usage remains minimal.

    If you need to modify the rows, you'd modify the row in the csv_reader generator before yield-ing it.