I have DataFrame partitioned by column:
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
.repartition(col("column_to"))
val structure = "schema_from" ::
"table_from" ::
"column_from" ::
"link_type" ::
"schema_to" ::
"table_to" ::
"column_to" :: Nil
How do I get a collection of arrays by partitions? That is, for each partition I need a collection. For example I need this method:
def getArrays(df: DataFrame): Iterator[Array] = { //Or Iterator[List]
???
}
All value for partition:
val allTargetCol = df.select(col("column_to")).distinct().collect().map(_.getString(0))
If you know the partition values, you can iterate over each partition value, call filter and then collect.
pseudo code
partitions = []
for partition_value in partition_values_list:
partitions.append(df.filter(f.col('partiton_column') == partition_value).collect())
Otherwise, you need to first make a list/array of distinct partition values and then repeat the above step.