Search code examples
pythonpandasdataframecassandra

Read large data from cassandra to python dataframe (memory error)


I am trying to read feature vector of 2048 dimension (1 miilion records) from cassandra to pandas dataframe it crashes every time.

I have 32 GB ram but still i am not able to read all data into memory, my python program crashes every time i try to load data in memory. I need all data in memory at once for my machine learning algorithm. (My data size in csv is 18GB. )

import pandas as pd

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory

auth_provider = PlainTextAuthProvider(username=CASSANDRA_USER, password=CASSANDRA_PASS)
cluster = Cluster(contact_points=[CASSANDRA_HOST], port=CASSANDRA_PORT,
    auth_provider=auth_provider)

session = cluster.connect(CASSANDRA_DB)
session.row_factory = dict_factory

query = "SELECT * FROM Table"

df = pd.DataFrame()

for row in session.execute(query):
    df = df.append(pd.DataFrame())

Is it a right approach to read data in pandas dataframe ? Any other memory efficient way to read all the data in dataframe?

Options i am considering as last try : 1) Reduce Feature vector dimension 2) Increase ram memory

I can not store data in csv or any other file system as i have some other operations to do on data in cassandra.

The program gets crash every time with message as Killed which is causing by memory issue.


Solution

  • I had a similar problem when reading data into Pandas dataframe from SQLServer (using ODBC connection). This seems to be a problem on Pandas' side. The dataframe took more than 10x space (in RAM) compared to the space the data was occupying in the original DB.

    Using an H2O dataframe is more efficient (in my case it took 2x-3x space in RAM).

    Also look at this post. If you can read data in chunks, that could help.