Search code examples
javaapache-sparkpysparkapache-spark-sql

Transformation with spark


I have such a data table, the number of records exceeds a million rows, the unique fields attr_id, attr_name are not more than 20

object_id attr_id attr_name value
7315dcbf-6178-4aa7-97ba-8875e2542471 f5af0127-fd4c-409b-afda-b896a80f7fe1 name Oleg
7315dcbf-6178-4aa7-97ba-8875e2542471 ebdd03e5-a380-46bd-b505-f1f166be6409 age 22
7315dcbf-6178-4aa7-97ba-8875e2542471 735fab29-cde8-4b34-979f-864b375f820f number 22222
50b9d918-b0ab-41d4-957c-24f506ceb737 f5af0127-fd4c-409b-afda-b896a80f7fe1 name Max
50b9d918-b0ab-41d4-957c-24f506ceb737 ebdd03e5-a380-46bd-b505-f1f166be6409 age
50b9d918-b0ab-41d4-957c-24f506ceb737 735fab29-cde8-4b34-979f-864b375f820f number 32222

I want to transform it to this format

object_id name age number
7315dcbf-6178-4aa7-97ba-8875e2542471 Oleg 22 22222
50b9d918-b0ab-41d4-957c-24f506ceb737 Max 32222

Is it possible to do this using spark, if so, how, or what material to read?


Solution

  • You can simply pivot the dataframe as follows:

    from pyspark.sql import functions as F
    dataframe.groupBy("object_id")\
             .pivot("attr_name")\
             .agg(F.first(F.col("value")))\
             .show(truncate=False)
    
    +------------------------------------+----+----+------+
    |object_id                           |age |name|number|
    +------------------------------------+----+----+------+
    |50b9d918-b0ab-41d4-957c-24f506ceb737|null|Max |32222 |
    |7315dcbf-6178-4aa7-97ba-8875e2542471|22  |Oleg|22222 |
    +------------------------------------+----+----+------+