I have got a parquet file / folder (about 1GB) that I would like to load into my local Cassandra DB. Unfortunately I could not find any way (except via SPARK (in Scala)) to directly load this file into CDB. If I blow out the parquet file into CSV it'll just get way too huge for my laptop.
I am setting up a Cassandra DB for a big data analytics case (I've got about 25TB in raw data that we need to get searchable fast). Right now I am running some local tests on how to optimally design the keyspaces, indices and tables before move to Cassandra as a Service on a Hyperscaler. Converting the data to CSV is not an option as this blows up too much.
COPY firmographics.company (col1,col2,col3.....) FROM 'C:\Users\Public\Downloads\companies.csv' WITH DELIMITER='\t' AND HEADER=TRUE;
Turns out, like Alex Ott said, it's easy enough to just write this up in SPARK. Below my code:
import findspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession\
.builder\
.appName("Spark Exploration App")\
.config('spark.jars.packages', 'com.datastax.spark:spark-cassandra-connector_2.11:2.3.2')\
.getOrCreate()
import pandas as pd
df = spark.read.parquet("/PATH/TO/FILE/")
import time
start = time.time()
df.drop('filename').write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="few_com", keyspace="bmbr")\
.save()
end = time.time()
print(end - start)