I am using Databricks to connect to an Eventhub, where each message comming from the EventHub may be very different from another.
In the message, I have a body and an id.
I am looking for performance, so I am avoiding collecting data or doing unecessary processings, also I want to do the saving in parallel by partition. However I am not sure on how to do this in a proper way.
I want to append the body of each ID in a different AND SPECIFIC table in batches, the ID will give me the information I need to save in the right table. So in order to do that I have been trying 2 approachs:
The approach 1 doens't look very attracting to me, the repartition process looks kind unecessary and I saw in the docs that even if I set a column as a partition, it may save many ids of that column in a single partition. It only garantees that all data related to that id is in the partition and not splitted
The approach 2 forces me to output from the pandas_udf, a dataframe with the same schema of the input, which is not going to happen since I am transforming the eventhub message from CSV to dataframe in order to save it to the table. I could return the same dataframe that I received, but it sounds weird.
Is there any nice approach I am not seeing?
If your Id has distinct number of values (kind of type/country column) you can use partitionBy to store and thereby saving them to different table will be faster. Otherwise create a derive column(using withColumn) from you id column by using the logic same as you want to use while deviding data across tables. Then you can use that derive column as a partition column in order to have faster load.