java apache-spark pyspark apache-spark-sql

Transformation with spark

I have such a data table, the number of records exceeds a million rows, the unique fields attr_id, attr_name are not more than 20

object_id	attr_id	attr_name	value
7315dcbf-6178-4aa7-97ba-8875e2542471	f5af0127-fd4c-409b-afda-b896a80f7fe1	name	Oleg
7315dcbf-6178-4aa7-97ba-8875e2542471	ebdd03e5-a380-46bd-b505-f1f166be6409	age	22
7315dcbf-6178-4aa7-97ba-8875e2542471	735fab29-cde8-4b34-979f-864b375f820f	number	22222
50b9d918-b0ab-41d4-957c-24f506ceb737	f5af0127-fd4c-409b-afda-b896a80f7fe1	name	Max
50b9d918-b0ab-41d4-957c-24f506ceb737	ebdd03e5-a380-46bd-b505-f1f166be6409	age
50b9d918-b0ab-41d4-957c-24f506ceb737	735fab29-cde8-4b34-979f-864b375f820f	number	32222

I want to transform it to this format

object_id	name	age	number
7315dcbf-6178-4aa7-97ba-8875e2542471	Oleg	22	22222
50b9d918-b0ab-41d4-957c-24f506ceb737	Max		32222

Is it possible to do this using spark, if so, how, or what material to read?

Solution

You can simply pivot the dataframe as follows:

from pyspark.sql import functions as F
dataframe.groupBy("object_id")\
         .pivot("attr_name")\
         .agg(F.first(F.col("value")))\
         .show(truncate=False)

+------------------------------------+----+----+------+
|object_id                           |age |name|number|
+------------------------------------+----+----+------+
|50b9d918-b0ab-41d4-957c-24f506ceb737|null|Max |32222 |
|7315dcbf-6178-4aa7-97ba-8875e2542471|22  |Oleg|22222 |
+------------------------------------+----+----+------+