Search code examples
dataframescalaapache-sparkcollectionspartitioning

Collect collections by partitions from DataFrame


I have DataFrame partitioned by column:

val dfDL = spark.read.option("delimiter", ",")
                     .option("header", true)
                     .csv(file.getPath.toUri.getPath)
                     .repartition(col("column_to"))

val structure =  "schema_from" ::
                 "table_from"  ::
                 "column_from" ::
                 "link_type"   ::
                 "schema_to"   ::
                 "table_to"    ::
                 "column_to"   :: Nil

How do I get a collection of arrays by partitions? That is, for each partition I need a collection. For example I need this method:

def getArrays(df: DataFrame): Iterator[Array] = { //Or Iterator[List]
   ???
}

All value for partition:

val allTargetCol = df.select(col("column_to")).distinct().collect().map(_.getString(0))

Solution

  • If you know the partition values, you can iterate over each partition value, call filter and then collect.

    pseudo code

    partitions = []
    
    for partition_value in partition_values_list:
      partitions.append(df.filter(f.col('partiton_column') == partition_value).collect())
    

    Otherwise, you need to first make a list/array of distinct partition values and then repeat the above step.