Search code examples
scalaapache-sparkapache-spark-sql

Spark dataframe pivot without aggregation


Am new to spark, currently am trying to do a pivot from rows to columns without aggregation like i need the data to be duplicated after the pivot.

i have the data like below

+----+----------+--------+
|abc |col       |position|
+----+----------+--------+
|1234|183500000X|0       |
|1234|0202211120|1       |
|1234|VA        |2       |
|1234|Y         |3       |
|1234|183500000X|0       |
|1234|21174     |1       |
|1234|NC        |2       |
|1234|N         |3       |
|1234|null      |0       |
|1234|null      |1       |
|1234|          |2       |
|1234|          |3       |
|1234|null      |0       |
|1234|null      |1       |
|1234|          |2       |
|1234|          |3       |

i would like to change it to below format

+----------+----------+----------+---+---+
|   abc    |         0|         1|  2|  3|
+----------+----------+----------+---+---+
|1234      |183500000X|0202211120| VA|  Y|
|1234      |183500000X|21174     | NC|  N|
+----------+----------+----------+---+---+

whenever i tried to use df.groupBy($"abc").pivot("position").agg(first($"col")) ... am getting only one record instead of all.

is there a way to get all the records without aggregation.

do i need to join with another dataframe to pull out the data.. kindly suggest me on the same.


Solution

  • import org.apache.spark.sql.functions._
    
    val df = spark.sparkContext.parallelize(Seq(
            ("A", "1", "0", 1),
            ("A", "2", "1", 1),
            ("A", "VA", "2", 1),
            ("A", "7", "3", 1),
            ("A", "11", "0", 2),
            ("A", "22", "1", 2),
            ("A", "VAVA", "2", 2),
            ("A", "77", "3", 2),
            ("B", "1", "0", 3),
            ("B", null, "1", 3)
          )).toDF("abc", "col", "position", "grouping")
    
    val result = df.groupBy("abc", "grouping")
                   .pivot("position")
                   .agg(expr("first(col)"))
                   .drop("grouping")
    
    result.show()
    
    +---+---+----+----+----+
    |abc|  0|   1|   2|   3|
    +---+---+----+----+----+
    |  A|  1|   2|  VA|   7|
    |  A| 11|  22|VAVA|  77|
    |  B|  1|null|null|null|
    +---+---+----+----+----+
    

    Leaving the nulls aspect out, you need to do the following by adding a grouping number in some way to your data. That is the clue. That's a data wrangling exercise. I do not know your data well enough to proffer advice.

    Supporting Notes

    The issue is that the grouping looks to be sequential and we do not know if there are always with sets of 4 or N. How do we apply such a grouping? You normally need a proper set of grouping keys, but we do not appear to have that here. Spark is less good at this type of thing and we need to preserve position, even with zipWithIndex this is a hard task.

    The issue is more this than the .pivot in fact.