Search code examples
python-3.xpysparkazure-databricks

How to write each dataframe partition into different tables


I am using Databricks to connect to an Eventhub, where each message comming from the EventHub may be very different from another.

In the message, I have a body and an id.

I am looking for performance, so I am avoiding collecting data or doing unecessary processings, also I want to do the saving in parallel by partition. However I am not sure on how to do this in a proper way.

I want to append the body of each ID in a different AND SPECIFIC table in batches, the ID will give me the information I need to save in the right table. So in order to do that I have been trying 2 approachs:

  1. Partitioning: Repartition(numPartitions, ID) -> ForeachPartition
  2. Grouping: groupBy('ID').apply(myFunction) #@pandas_udf GROUPED_MAP

The approach 1 doens't look very attracting to me, the repartition process looks kind unecessary and I saw in the docs that even if I set a column as a partition, it may save many ids of that column in a single partition. It only garantees that all data related to that id is in the partition and not splitted

The approach 2 forces me to output from the pandas_udf, a dataframe with the same schema of the input, which is not going to happen since I am transforming the eventhub message from CSV to dataframe in order to save it to the table. I could return the same dataframe that I received, but it sounds weird.

Is there any nice approach I am not seeing?


Solution

  • If your Id has distinct number of values (kind of type/country column) you can use partitionBy to store and thereby saving them to different table will be faster. Otherwise create a derive column(using withColumn) from you id column by using the logic same as you want to use while deviding data across tables. Then you can use that derive column as a partition column in order to have faster load.