I am working on analyzing some very large files (~200 million rows)
csv_filename=pd.read_csv('filename.txt',sep="\t",error_bad_lines=False)
The program runs for about a half an hour before I get this error message:
MemoryError: Unable to allocate 3.25 GiB for an array with shape (7, 62388743) and data type object
I'm wondering if there is a way to bypass this memory error, or if there is a different function I can use that won't require as much memory? I have split the file into pieces, but the issue with that is that I need all of the data in one dataframe so that I can analyze it as a whole.
You can limit the number of columns with usecols
. This will reduce the memory footprint. You also seem to have some bad data in the CSV file making columns you think should be int64
to be object
. These could be empty cells, or any non-digit value. Here is an example that will read the csv and then scan for bad data. This example uses commas, not tab, because thats a bit easier to demonstrate.
import pandas as pd
import numpy as np
import io
import re
test_csv = io.StringIO("""field1,field2,field3,other
1,2,3,this
4,what?,6,is
7,,9,extra""")
_numbers_re = re.compile(r"\d+$")
df = pd.read_csv(test_csv,sep=",",error_bad_lines=False,
usecols=['field1', 'field2', 'field3'])
print(df)
# columns that arent int64
bad_cols = list(df.dtypes[df.dtypes!=np.dtype('int64')].index)
if bad_cols:
print("bad cols", bad_cols)
for bad_col in bad_cols:
col = df[bad_col]
bad = col[col.str.match(_numbers_re) != True]
print(bad)
exit(1)