Search code examples
pythonpandascsvchunking

memory error reading big size csv in pandas


My laptops memory is 8 gig and I was trying to read and process a big csv file, and got memory issues, I found a solution which is using chunksize to process the file chunk by chunk, but apperntly when uisng chunsize the file format vecoe textreaderfile and the code I was using to process normal csvs with it doesnt work anymore, this is the code I'm trying to use to read how many sentences inside the csv file.

wdata = pd.read_csv(fileinput, nrows=0,).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1000)

data = wdata.count()
print(data)

the error I'm getting is:-

Traceback (most recent call last):
  File "table.py", line 24, in <module>
    data = wdata.count()
AttributeError: 'TextFileReader' object has no attribute 'count'

I tried another way arround aswell by running this code


TextFileReader = pd.read_csv(fileinput, chunksize=1000)  # the number of rows per chunk

dfList = []
for df in TextFileReader:
    dfList.append(df)

df = pd.concat(dfList,sort=False)
print(df)

and it gives this error


   data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 908, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 4


Solution

  • You have to iterate over the chunks:

    csv_length = 0    
    for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=10000):
        csv_length += chunk.count()
    print(csv_length )