I have a Dataset is evenly divided by record count per partition but some of the partitions have a data size that is 4 or more times larger than the other ones. Each record has a collection and I imagine could be much larger in some records. This causes what it looks like a data skew some of partitions take a lot longer due to this unbalanced records. If I could enable some logging in Spark to print the size in bytes of each partition being process and size of the row that could help me in troubleshooting. Because the data is being sent to Cassandra using their Spark connector which is doing some re-partition of their own.
There is no way to re-partition a dataset by size. In my case I had an array for which some of the rows will have a very large amount of entries. This turnout to be an anomaly on the data and I was able to filter out those rows by simply adding a filter to the dataset.
df.filter( size($colname) < 1000)