Search code examples
pythonarrayscsvnumpygenfromtxt

Efficient way to process CSV file into a numpy array


CSV file may not be clean (lines with inconsistent number of elements), unclean lines would need to be disregarded. String manipulation is required during processing.

Example input:

20150701 20:00:15.173,0.5019,0.91665

Desired output: float32 (pseudo-date, seconds in the day, f3, f4)

0.150701 72015.173 0.5019 0.91665 (+ the trailing trash floats usually get)

The CSV file is also very big, the numpy array in memory would be expected to take 5-10 GB, CSV file is over 30GB.

Looking for an efficient way to process the CSV file and end up with a numpy array.

Current solution: use csv module, process line by line and use a list() as a buffer that later gets turned to numpy array with asarray(). Problem is, during the turning process memory consumption is doubled and the copying process adds execution overhead.

Numpy's genfromtxt and loadtxt don't appear to be able to process the data as desired.


Solution

  • If you know in advance how many rows are in the data, you could dispense with the intermediate list and write directly to the array.

    import numpy as np
    
    no_rows = 5
    no_columns = 4
    
    a = np.zeros((no_rows, no_columns), dtype = np.float)
    
    with open('myfile') as f:
        for i, line in enumerate(f):
            a[i,:] = cool_function_that_returns_formatted_data(line)