I have data that looks as follows:
| OmgevingID | AdministratieKantoorID | WerkgeverID | AdministratieID | JaarID | VolgnummerRun | PersoneelsnummerVerloning | Periode | Component | Tabel | Datum |
|------------|------------------------|-------------|-----------------|--------|---------------|---------------------------|---------|-----------|-------|-----------|
| 28 | 1 | 19 | 243 | 2022 | 34 | 33 | 6 | 70 | 5557.0| 2023-09-20|
| 28 | 1 | 19 | 243 | 2022 | 34 | 33 | 6 | 260 | 3990.0| 2023-09-20|
| 28 | 1 | 19 | 243 | 2022 | 34 | 40 | 6 | 70 | 2610.0| 2023-09-20|
| 28 | 1 | 19 | 243 | 2022 | 34 | 40 | 6 | 260 | 2475.0| 2023-09-20|
Now, I want to transform it so that it looks like this, with the a seperate column for each component of choice. For the values I want the 'tabel' column. Below you'll find an example of expected output.
| OmgevingID | AdministratieKantoorID | WerkgeverID | AdministratieID | JaarID | VolgnummerRun | PersoneelsnummerVerloning | Periode | Component_70_tabel | Component_260_tabel | Datum |
|------------|------------------------|-------------|-----------------|--------|---------------|---------------------------|---------|--------------------|---------------|-----------|
| 28 | 1 | 19 | 243 | 2022 | 34 | 33 | 6 | 5577.0 | 3990.0 | 2023-09-20|
| 28 | 1 | 19 | 243 | 2022 | 34 | 40 | 6 | 2610.0 | 2475.0 | 2023-09-20|
I have achieved this by using the following query:
SELECT
OmgevingID,
AdministratieKantoorID,
WerkgeverID,
AdministratieID,
JaarID,
VolgnummerRun,
PersoneelsnummerVerloning,
Periode,
MAX(CASE WHEN Component = 70 THEN Tabel END) AS Component_70_tabel,
MAX(CASE WHEN Component = 260 THEN Tabel END) AS Component_260_tabel,
Datum
FROM
YourTableName
GROUP BY
OmgevingID,
AdministratieKantoorID,
WerkgeverID,
AdministratieID,
JaarID,
VolgnummerRun,
PersoneelsnummerVerloning,
Periode,
Datum;
But I'm looking for an easier, more intuitive way, perhaps a PIVOT? For starters, the MAX(CASE WHEN...) syntax is a bit strange because the Combination of OmgevingID, AdministratieKantoorID, WerkgeverID, JaarID, VolgnummerRun, PersoneelsnummerVerloning, Periode and Component fields is a composite primary key. So there can only ever be ONE tabel value per group, which makes the MAX feel a bit strange. I am using spark-sql, but if there is a clean way in pyspark this is also optional.
In pyspark we can use groupby
+ pivot
to reshape the dataframe
keys = ['OmgevingID', 'AdministratieKantoorID', 'WerkgeverID',
'AdministratieID', 'JaarID', 'VolgnummerRun',
'PersoneelsnummerVerloning', 'Periode', 'Datum']
df1 = (
df
.withColumn('Component', F.expr("'Component_' || Component || '_Tabel'"))
.groupBy(keys)
.pivot('Component')
.agg(F.max('Tabel'))
)
df1.show()
+----------+----------------------+-----------+---------------+------+-------------+-------------------------+-------+----------+-------------------+------------------+
|OmgevingID|AdministratieKantoorID|WerkgeverID|AdministratieID|JaarID|VolgnummerRun|PersoneelsnummerVerloning|Periode| Datum|Component_260_Tabel|Component_70_Tabel|
+----------+----------------------+-----------+---------------+------+-------------+-------------------------+-------+----------+-------------------+------------------+
| 28| 1| 19| 243| 2022| 34| 33| 6|2023-09-20| 3990.0| 5557.0|
| 28| 1| 19| 243| 2022| 34| 40| 6|2023-09-20| 2475.0| 2610.0|
+----------+----------------------+-----------+---------------+------+-------------+-------------------------+-------+----------+-------------------+------------------+