I'd like to use adaptive query execution (AQE) to coalesce small partitions, however in jobs that don't have a shuffle (for example you read something from somewhere and write it out without any transformations) AQE does not work.
So I need to force some kind of shuffle. What is the best way to do it, such that the shuffle is not costly?
Should I, for example, just join my dataframe on a dataframe with just 1 row or something?
Or is there a better way?
Maybe I should do something else entirely and not use AQE in this case. Let me know. Thanks.
Looks like I need to write SparkExtensions for this. They allow you to add your own rules to catalyst optimizer. I found some code here:
https://gist.github.com/GrigorievNick/2f77b26719e46c544e3f20aa48862719
And also this video on Databricks' YouTube channel: https://www.youtube.com/watch?v=IlovS-Y7KUk