Search code examples
scalaapache-sparkpysparkdatabricksazure-databricks

Finding total time it takes for dataframe write in ADLS path?


I write 100+ dataframes in a loop. How do I log the total duration a single dataframe took to write a CSV in ADLS path? I would like to store this information in a table where I can check which dataframe needs an optimization.

Sample code to write a TSV to Datalake path:

dataFrame
      .repartition(1)
      .write
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .option("sep", colDelim)
      .option("quoteAll", true)
      .option("escape", "\"")
      .mode("overwrite")
      .save(filePath + fileName)

Solution

  • Here, you can use below pyspark code for getting duration of each dataframe write operation.

    I am assuming you are having list of dataframes.

    import time
    
    log = []
    
    for i,df in  enumerate(dataframe_list):
        start_time = time.time()
        
        df.repartition(1).write.format("csv") \
        .option("header", "true") \
        .option("sep", " ") \
        .option("quoteAll", "true") \
        .option("escape", "\"") \
        .mode("overwrite") \
        .save("/mnt/csv/dataframe_"+str(i))
        
        end_time = time.time()
        duration = end_time - start_time
        
        each_df_time = {
        "DF_name":"dataframe"+str(i),
        "time_taken":duration
        }
        log.append(each_df_time)
        
    log_df = spark.createDataFrame(log,schema="DF_name STRING, time_taken DOUBLE")
    display(log_df)
    

    Output:

    enter image description here