Search code examples
python-2.7pandasblazelarge-data

Can't load large file (~2gb) using Pandas or Blaze in Python


I have a file with >5 million rows and 20 fields. I would like to open it in Pandas, but got an out of memory error:

pandas.parser.CParserError: Error tokenizing data. C error: out of memory

I have then read up some posts on similar issues and discovered Blaze, but following three methods (.Data, .CSV, .Table), none worked apparently.

# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import re
import numpy as np
import sys
import blaze as bz
reload(sys)
sys.setdefaultencoding('utf-8')

# Gave an out of memory error
'''data = pd.read_csv('file.csv', header=0, encoding='utf-8', low_memory=False)
df = DataFrame(data)

print df.shape
print df.head'''

data = bz.Data('file.csv')

# Tried the followings too, but no luck
'''data = bz.CSV('file.csv')
data = bz.Table('file.csv')'''

print data
print data.head(5)

Output:

_1
_1.head(5)
[Finished in 1.0s]

Solution

  • Blaze

    For the bz.Data(...) object you'll have to actually do something to get a result. It loads the data as needed. If you were at a terminal and typed in

    >>> data
    

    you would get the head repr-ed out to the screen. If you need to use the print function then try

    bz.compute(data.head(5))
    

    dask.dataframe

    You might also consider using dask.dataframe, which has a similar (though subsetted) API to pandas

    >>> import dask.dataframe as dd
    >>> data = dd.read_csv('file.csv', header=0, encoding='utf-8')