Search code examples
arraysapache-sparkapache-spark-sqltransformationlarge-data

Spark Data set transformation to array


I have a dataset like below; with values of col1 repeating multiple times and unique values of col2. This original dataset can almost a billion rows, so I do not want to use collect or collect_list as it will not scale-out for my use case.

Original Dataset:

+---------------------|
|    col1  |    col2  |
+---------------------|
|    AA|    11        |
|    BB|    21        |
|    AA|    12        |
|    AA|    13        |
|    BB|    22        |
|    CC|    33        |
+---------------------|

I want to transform the dataset into the following array format. newColumn as an array of col2.

Transformed Dataset:

+---------------------|
|col1  |     newColumn|
+---------------------|
|    AA|    [11,12,13]|
|    BB|    [21,22]   |
|    CC|    [33]      |
+---------------------|

I have seen this solution, but it uses collect_list and will not scale-out on big datasets.


Solution

  • Using the inbuilt functions of spark are always the best way. I see no problem in using the collect_list function. As long as you have sufficient memory, this would be the best way. One way of optimizing your job would be to save your data as parquet , bucket it by column A and saving it as a table. Better would be to also partition it by some column that evenly distributes data.

    For example,

    df_stored = #load your data from csv or parquet or any format'
    spark.catalog.setCurrentDatabase(database_name)
    df_stored.write.mode("overwrite").format("parquet").partitionBy(part_col).bucketBy(10,"col1").option("path",savepath).saveAsTable(tablename)
    df_analysis = spark.table(tablename)
    df_aggreg = df_analysis.groupby('col1').agg(F.collect_list(col('col2')))
    

    This would speeden up the aggregation and avoid a lot of shuffle. try it out