Search code examples
pythoncassandradatastaxdatastax-astra

Bulk Loading large dataframe into Astra DB


I'm trying to load my dataframe into AstraDB but its taking forever to load.. i was wondering if there's a faster method to do it via python?

import cassandra 
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import pandas as pd

cloud_config= {
        'secure_connect_bundle': 'secure-connect-capstone-project.zip'
}
auth_provider = PlainTextAuthProvider(user,pass)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
#connect to keyspace_name
session = cluster.connect('iac689')

query = """insert into data_2 (truck_id, active, reading_id, start_mileage, start_time, truck_name, type)
values (%s,%s,%s,%s,%s,%s,%s)"""
for i in df.values:
    session.execute(query, [i[0],i[1],i[2],i[3],i[4],i[5],i[6]])

Solution

  • If you really need to do this via Python, then you can speedup code by:

    • Using prepared queries - call session.prepare on your query string, and use it in session.execute.
    • Use asynchronous API (execute_async) instead of synchronous (execute). But you need to track how many in-flight queries you have, etc. to avoid getting errors.

    Really, I would recommend to not re-invent the wheel, but dump data as CSV or JSON file, and use DSBulk to load data into Cassandra/Astra - this tool is heavily optimized for loading/unloading data from Cassandra/Astra.