Search code examples
pythonnumpyhdf5h5pynumpy-memmap

How to read a large text file avoiding reading line-by-line :: Python


I have a large data file (N,4) which I am mapping line-by-line. My files are 10 GBs, a simplistic implementation is given below. Though the following works, it takes huge amount of time.

I would like to implement this logic such that the text file is read directly and I can access the elements. Thereafter, I need to sort the whole (mapped) file based on column-2 elements.

The examples I see online assumes smaller piece of data (d) and using f[:] = d[:]but I can't do that since d is huge in my case and eats my RAM.

PS: I know how to load the file using np.loadtxt and sort them using argsort, but that logic fails (memory error) for GB file size. Would appreciate any direction.

nrows, ncols = 20000000, 4  # nrows is really larger than this no. this is just for illustration
f = np.memmap('memmapped.dat', dtype=np.float32,
              mode='w+', shape=(nrows, ncols))

filename = "my_file.txt"

with open(filename) as file:

    for i, line in enumerate(file):
        floats = [float(x) for x in line.split(',')]
        f[i, :] = floats
del f

Solution

  • EDIT: Instead of do-it-yourself chunking, it's better to use the chunking feature of pandas, which is much, much faster than numpy's load_txt.

    import numpy as np
    import pandas as pd
    
    ## create csv file for testing
    np.random.seed(1)
    nrows, ncols = 100000, 4
    data = np.random.uniform(size=(nrows, ncols))
    np.savetxt('bigdata.csv', data, delimiter=',')
    
    ## read it back
    chunk_rows = 12345
    # Replace np.empty by np.memmap array for large datasets.
    odata = np.empty((nrows, ncols), dtype=np.float32)
    oindex = 0
    chunks = pd.read_csv('bigdata.csv', chunksize=chunk_rows, 
                         names=['a', 'b', 'c', 'd'])
    for chunk in chunks:
        m, _ = chunk.shape
        odata[oindex:oindex+m, :] = chunk
        oindex += m
    
    # check that it worked correctly.
    assert np.allclose(data, odata, atol=1e-7)
    

    The pd.read_csv function in chunked mode returns a special object that can be used in a loop such as for chunk in chunks:; at every iteration, it will read a chunk of the file and return its contents as a pandas DataFrame, which can be treated as a numpy array in this case. The parameter names is needed to prevent it from treating the first line of the csv file as column names.

    Old answer below

    The numpy.loadtxt function works with a filename or something that will return lines in a loop in a construct such as:

    for line in f: 
       do_something()
    

    It doesn't even need to pretend to be a file; a list of strings will do!

    We can read chunks of the file that are small enough to fit in memory and provide batches of lines to np.loadtxt.

    def get_file_lines(fname, seek, maxlen):
        """Read lines from a section of a file.
        
        Parameters:
            
        - fname: filename
        - seek: start position in the file
        - maxlen: maximum length (bytes) to read
        
        Return:
            
        - lines: list of lines (only entire lines).
        - seek_end: seek position at end of this chunk.
        
        Reference: https://stackoverflow.com/a/63043614/6228891
        Copying: any of CC-BY-SA, CC-BY, GPL, BSD, LPGL
        Author: Han-Kwang Nienhuys
        """
        f = open(fname, 'rb') # binary for Windows \r\n line endings
        f.seek(seek)
        buf = f.read(maxlen)
        n = len(buf)
        if n == 0:
            return [], seek
        
        # find a newline near the end
        for i in range(min(10000, n)):
            if buf[-i] == 0x0a:
                # newline
                buflen = n - i + 1
                lines = buf[:buflen].decode('utf-8').split('\n')
                seek_end = seek + buflen
                return lines, seek_end
        else:
            raise ValueError('Could not find end of line')
    
    import numpy as np
    
    ## create csv file for testing
    np.random.seed(1)
    nrows, ncols = 10000, 4
    
    data = np.random.uniform(size=(nrows, ncols))
    np.savetxt('bigdata.csv', data, delimiter=',')
    
    # read it back        
    fpos = 0
    chunksize = 456 # Small value for testing; make this big (megabytes).
    
    # we will store the data here. Replace by memmap array if necessary.
    odata = np.empty((nrows, ncols), dtype=np.float32)
    oindex = 0
    
    while True:
        lines, fpos = get_file_lines('bigdata.csv', fpos, chunksize)
        if not lines:
            # end of file
            break
        rdata = np.loadtxt(lines, delimiter=',')
        m, _ = rdata.shape
        odata[oindex:oindex+m, :] = rdata
        oindex += m
        
    assert np.allclose(data, odata, atol=1e-7)
    

    Disclaimer: I tested this in Linux. I expect this to work in Windows, but it could be that the handling of '\r' characters causes problems.