Search code examples
javaapache-sparkapache-spark-sqljavadoc

How to explode an array into multiple columns in Spark Java


I have the specific Dataframe

+-------+----------------------+
|BlockId|Entity_BlockNumberList|
+-------+----------------------+
|      1|      [[1, 4], [3, 5]]|
|      2|      [[1, 4], [3, 5]]|
|      3|      [[2, 4], [4, 4]]|
|      4|      [[2, 4], [4, 4]]|
|      5|      [[2, 4], [3, 5]]|
|      6|      [[3, 5], [5, 3]]|
|      7|      [[1, 4], [4, 4]]|
|      8|  [[3, 5], [4, 4], ...|
|      9|  [[1, 4], [2, 4], ...|
+-------+----------------------+

I want to create multiple columns for each element of the nested arrays in the 2nd Column Something like that

BlockId | Entity_BlockNumberList | 1st Array | 2nd Array | ...

Something like explode but in Columns by using Javadoc I have found this code on the internet

val numCols = df
  .withColumn("letters_size", size($"letters"))
  .agg(max($"letters_size"))
  .head()
  .getInt(0)

df
  .select(
    (0 until numCols).map(i => $"letters".getItem(i).as(s"col$i")): _*
  )
  .show()

which is in scala but i cant quite get how to do the select func especially the (0 until numCols) by using Java.

example: input:

+---------+
|  letters|
+---------+
|[a, b, c]|
|[d, e, f]|
|     null|
+---------+

expected output:

+----+----+----+
|col0|col1|col2|
+----+----+----+
|   a|   b|   c|
|   d|   e|   f|
|null|null|null|
+----+----+----+

But not hardcoded because my arrays dont have the same length.

I tried that

df.selectExpr(df.select(
                expr("concat('struct(',concat_ws(',',transform(sequence(0,max(size(entities))-1),x -> concat('entities[',x,'] as col',x))),') as columns')")
                ).as(Encoders.STRING()).head()).show();

The output was the follow:

+------------+
|     columns|
+------------+
|    [1, 3,,]|
|    [1, 3,,]|
|    [2, 4,,]|
|    [2, 4,,]|
|    [2, 3,,]|
|    [3, 5,,]|
|    [1, 4,,]|
|[3, 4, 5, 6]|
|  [1, 2, 5,]|
+------------+

Solution

  • The best I could do to achieve what I wanted that works and in nested arrays is do a for like that.

    for (int i; i < numCols; i++) {
        df = df.withColumn("c" + i, df.col("entities").getItem(i));
    }
    

    The output:

    +-----+------------+---+---+----+----+
    |block|    entities| c0| c1|  c2|  c3|
    +-----+------------+---+---+----+----+
    |    1|      [1, 3]|  1|  3|null|null|
    |    2|      [1, 3]|  1|  3|null|null|
    |    3|      [2, 4]|  2|  4|null|null|
    |    4|      [2, 4]|  2|  4|null|null|
    |    5|      [2, 3]|  2|  3|null|null|
    |    6|      [3, 5]|  3|  5|null|null|
    |    7|      [1, 4]|  1|  4|null|null|
    |    8|[3, 4, 5, 6]|  3|  4|   5|   6|
    |    9|   [1, 2, 5]|  1|  2|   5|null|
    +-----+------------+---+---+----+----+
    

    Kinda worked for what i wanted to do next.