Search code examples
csvimportcassandraparquet

Is there a simple way to load parquet files directly into Cassandra?


I have got a parquet file / folder (about 1GB) that I would like to load into my local Cassandra DB. Unfortunately I could not find any way (except via SPARK (in Scala)) to directly load this file into CDB. If I blow out the parquet file into CSV it'll just get way too huge for my laptop.

I am setting up a Cassandra DB for a big data analytics case (I've got about 25TB in raw data that we need to get searchable fast). Right now I am running some local tests on how to optimally design the keyspaces, indices and tables before move to Cassandra as a Service on a Hyperscaler. Converting the data to CSV is not an option as this blows up too much.

COPY firmographics.company (col1,col2,col3.....) FROM 'C:\Users\Public\Downloads\companies.csv' WITH DELIMITER='\t' AND HEADER=TRUE;

Solution

  • Turns out, like Alex Ott said, it's easy enough to just write this up in SPARK. Below my code:

    import findspark
    
    from pyspark.sql import SparkSession  
    findspark.init()
    
    spark = SparkSession\
        .builder\
        .appName("Spark Exploration App")\
        .config('spark.jars.packages', 'com.datastax.spark:spark-cassandra-connector_2.11:2.3.2')\
        .getOrCreate()
    
    import pandas as pd
    df = spark.read.parquet("/PATH/TO/FILE/")
    
    import time
    start = time.time()
    
    df.drop('filename').write\
        .format("org.apache.spark.sql.cassandra")\
        .mode('append')\
        .options(table="few_com", keyspace="bmbr")\
        .save()
    
    end = time.time()
    print(end - start)