I have such a data table, the number of records exceeds a million rows, the unique fields attr_id, attr_name are not more than 20
object_id | attr_id | attr_name | value |
---|---|---|---|
7315dcbf-6178-4aa7-97ba-8875e2542471 | f5af0127-fd4c-409b-afda-b896a80f7fe1 | name | Oleg |
7315dcbf-6178-4aa7-97ba-8875e2542471 | ebdd03e5-a380-46bd-b505-f1f166be6409 | age | 22 |
7315dcbf-6178-4aa7-97ba-8875e2542471 | 735fab29-cde8-4b34-979f-864b375f820f | number | 22222 |
50b9d918-b0ab-41d4-957c-24f506ceb737 | f5af0127-fd4c-409b-afda-b896a80f7fe1 | name | Max |
50b9d918-b0ab-41d4-957c-24f506ceb737 | ebdd03e5-a380-46bd-b505-f1f166be6409 | age | |
50b9d918-b0ab-41d4-957c-24f506ceb737 | 735fab29-cde8-4b34-979f-864b375f820f | number | 32222 |
I want to transform it to this format
object_id | name | age | number |
---|---|---|---|
7315dcbf-6178-4aa7-97ba-8875e2542471 | Oleg | 22 | 22222 |
50b9d918-b0ab-41d4-957c-24f506ceb737 | Max | 32222 |
Is it possible to do this using spark, if so, how, or what material to read?
You can simply pivot the dataframe as follows:
from pyspark.sql import functions as F
dataframe.groupBy("object_id")\
.pivot("attr_name")\
.agg(F.first(F.col("value")))\
.show(truncate=False)
+------------------------------------+----+----+------+
|object_id |age |name|number|
+------------------------------------+----+----+------+
|50b9d918-b0ab-41d4-957c-24f506ceb737|null|Max |32222 |
|7315dcbf-6178-4aa7-97ba-8875e2542471|22 |Oleg|22222 |
+------------------------------------+----+----+------+