Search code examples
pythonoracle-databaseazure-data-lake

Oracle Partitioned table to Datalake using python


Hello I'm trying to Oracle Partitioned table to Datalake parquet file.

Using this script

# Convert it to Spark SQL table and save it as parquet format

df.write \
    .format("parquet") \
    .option("path","/archive/" + schema_name + "/" + table_name + ".parquet") \
    .mode("append") \
    .saveAsTable(table_name)

This code get all data of table not partition.

spark = SparkSession.builder \
        .appName("Load " + schema_name + " " + table_name + " from Oracle into Parquet and creating Table") \
        .getOrCreate()

This one is Creating table from Oracle

How can i get this only parquet :)


Solution

  • Created table in oracle named Checkes Add partition name to table after I can read this partition name from Spark.

    
    query = '(select partition_name from Schema.checkes c) checkes'
    df = spark.read \
        .format("jdbc") \
        .option("url","jdbc:oracle:thin:@" + db_host + ":" + db_port + "/" + db_service) \
        .option("dbtable",query) \
        .option("user",db_user) \
        .option("password",db_pass) \
        .option("driver","oracle.jdbc.OracleDriver") \
        .option("encoding","UTF-8") \
        .option("fetchSize", 10000) \
        .option("numPartitions",40) \
        .load()
    
    print("part count: " + str(df.count()))
    if df.count() > 0:partition_name = df.select("partition_name").collect()[0]["partition_name"]
    
    df1 = spark.read \
        .format("jdbc") \
        .option("url","jdbc:oracle:thin:@" + db_host + ":" + db_port + "/" + db_service) \
        .option("dbtable",query1) \
        .option("user",db_user) \
        .option("password",db_pass) \
        .option("driver","oracle.jdbc.OracleDriver") \
        .option("encoding","UTF-8") \
        .option("fetchSize", 10000) \
        .option("numPartitions",40) \
        .load()
        ```