Search code examples
pythoncsvmemoryjupyterlarge-files

Large csv files: MemoryError: Unable to allocate 3.25 GiB for an array with shape (7, 62388743) and data type object


I am working on analyzing some very large files (~200 million rows)

csv_filename=pd.read_csv('filename.txt',sep="\t",error_bad_lines=False)

The program runs for about a half an hour before I get this error message:

MemoryError: Unable to allocate 3.25 GiB for an array with shape (7, 62388743) and data type object

I'm wondering if there is a way to bypass this memory error, or if there is a different function I can use that won't require as much memory? I have split the file into pieces, but the issue with that is that I need all of the data in one dataframe so that I can analyze it as a whole.


Solution

  • You can limit the number of columns with usecols. This will reduce the memory footprint. You also seem to have some bad data in the CSV file making columns you think should be int64 to be object. These could be empty cells, or any non-digit value. Here is an example that will read the csv and then scan for bad data. This example uses commas, not tab, because thats a bit easier to demonstrate.

    import pandas as pd
    import numpy as np
    import io
    import re
    
    test_csv = io.StringIO("""field1,field2,field3,other
    1,2,3,this
    4,what?,6,is
    7,,9,extra""")
    
    _numbers_re = re.compile(r"\d+$")
    
    df = pd.read_csv(test_csv,sep=",",error_bad_lines=False, 
        usecols=['field1', 'field2', 'field3'])
    print(df)
    
    # columns that arent int64
    bad_cols = list(df.dtypes[df.dtypes!=np.dtype('int64')].index)
    if bad_cols:
        print("bad cols", bad_cols)
        for bad_col in bad_cols:
            col = df[bad_col]
            bad = col[col.str.match(_numbers_re) != True]
            print(bad)
        exit(1)