Search code examples
sqlpysparkgroup-byapache-spark-sqlpivot

Difference between Pivot or Group By in SQL


I have data that looks as follows:

| OmgevingID | AdministratieKantoorID | WerkgeverID | AdministratieID | JaarID | VolgnummerRun | PersoneelsnummerVerloning | Periode | Component | Tabel | Datum     | 
|------------|------------------------|-------------|-----------------|--------|---------------|---------------------------|---------|-----------|-------|-----------|
| 28         | 1                      | 19          | 243             | 2022   | 34            | 33                        | 6       | 70        | 5557.0| 2023-09-20|
| 28         | 1                      | 19          | 243             | 2022   | 34            | 33                        | 6       | 260       | 3990.0| 2023-09-20|
| 28         | 1                      | 19          | 243             | 2022   | 34            | 40                        | 6       | 70        | 2610.0| 2023-09-20|
| 28         | 1                      | 19          | 243             | 2022   | 34            | 40                        | 6       | 260       | 2475.0| 2023-09-20|

Now, I want to transform it so that it looks like this, with the a seperate column for each component of choice. For the values I want the 'tabel' column. Below you'll find an example of expected output.

| OmgevingID | AdministratieKantoorID | WerkgeverID | AdministratieID | JaarID | VolgnummerRun | PersoneelsnummerVerloning | Periode | Component_70_tabel | Component_260_tabel | Datum     | 
|------------|------------------------|-------------|-----------------|--------|---------------|---------------------------|---------|--------------------|---------------|-----------|
| 28         | 1                      | 19          | 243             | 2022   | 34            | 33                        | 6       | 5577.0             | 3990.0        | 2023-09-20|
| 28         | 1                      | 19          | 243             | 2022   | 34            | 40                        | 6       | 2610.0             | 2475.0        | 2023-09-20|

I have achieved this by using the following query:

SELECT
    OmgevingID,
    AdministratieKantoorID,
    WerkgeverID,
    AdministratieID,
    JaarID,
    VolgnummerRun,
    PersoneelsnummerVerloning,
    Periode,
    MAX(CASE WHEN Component = 70  THEN Tabel END) AS Component_70_tabel,
    MAX(CASE WHEN Component = 260 THEN Tabel END) AS Component_260_tabel,
    Datum
FROM
    YourTableName
GROUP BY
    OmgevingID,
    AdministratieKantoorID,
    WerkgeverID,
    AdministratieID,
    JaarID,
    VolgnummerRun,
    PersoneelsnummerVerloning,
    Periode,
    Datum;

But I'm looking for an easier, more intuitive way, perhaps a PIVOT? For starters, the MAX(CASE WHEN...) syntax is a bit strange because the Combination of OmgevingID, AdministratieKantoorID, WerkgeverID, JaarID, VolgnummerRun, PersoneelsnummerVerloning, Periode and Component fields is a composite primary key. So there can only ever be ONE tabel value per group, which makes the MAX feel a bit strange. I am using spark-sql, but if there is a clean way in pyspark this is also optional.


Solution

  • In pyspark we can use groupby + pivot to reshape the dataframe

    keys = ['OmgevingID', 'AdministratieKantoorID', 'WerkgeverID',
            'AdministratieID', 'JaarID', 'VolgnummerRun',
            'PersoneelsnummerVerloning', 'Periode', 'Datum']
    
    df1 = (
        df
        .withColumn('Component', F.expr("'Component_' || Component || '_Tabel'"))
        .groupBy(keys)
        .pivot('Component')
        .agg(F.max('Tabel'))
    )
    

    df1.show()
    
    +----------+----------------------+-----------+---------------+------+-------------+-------------------------+-------+----------+-------------------+------------------+
    |OmgevingID|AdministratieKantoorID|WerkgeverID|AdministratieID|JaarID|VolgnummerRun|PersoneelsnummerVerloning|Periode|     Datum|Component_260_Tabel|Component_70_Tabel|
    +----------+----------------------+-----------+---------------+------+-------------+-------------------------+-------+----------+-------------------+------------------+
    |        28|                     1|         19|            243|  2022|           34|                       33|      6|2023-09-20|             3990.0|            5557.0|
    |        28|                     1|         19|            243|  2022|           34|                       40|      6|2023-09-20|             2475.0|            2610.0|
    +----------+----------------------+-----------+---------------+------+-------------+-------------------------+-------+----------+-------------------+------------------+