Search code examples
scalaapache-sparkapache-spark-sqldatastaxdatabricks

Spark-sql Pivoting not working as expected at least for taken bulk data


Pivoting dont work properly majority of the time , i.e. increase the source table records.


source_df
+---------------+-------------------+--------------------+-------------------+-------------------+--------------+-----------------------+----------------------+-----------+--------------+-------------------+----------------+---------------+---------------+
|model_family_id|classification_type|classification_value|benchmark_type_code|          data_date|data_item_code|data_item_value_numeric|data_item_value_string|fiscal_year|fiscal_quarter|        create_date|last_update_date|create_user_txt|update_user_txt|
+---------------+-------------------+--------------------+-------------------+-------------------+--------------+-----------------------+----------------------+-----------+--------------+-------------------+----------------+---------------+---------------+
|              1|            COUNTRY|                 HKG|               MEAN|2017-12-31 00:00:00|   CREDITSCORE|                     13|                   bb-|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|            OBS_CNT|2017-12-31 00:00:00|   CREDITSCORE|                    649|                    aa|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|         OBS_CNT_CA|2017-12-31 00:00:00|   CREDITSCORE|                    649|                  null|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|       PERCENTILE_0|2017-12-31 00:00:00|   CREDITSCORE|                      3|                    aa|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|      PERCENTILE_10|2017-12-31 00:00:00|   CREDITSCORE|                      8|                  bbb+|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|     PERCENTILE_100|2017-12-31 00:00:00|   CREDITSCORE|                     23|                     d|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|      PERCENTILE_25|2017-12-31 00:00:00|   CREDITSCORE|                     11|                   bb+|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|      PERCENTILE_50|2017-12-31 00:00:00|   CREDITSCORE|                     14|                    b+|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|      PERCENTILE_75|2017-12-31 00:00:00|   CREDITSCORE|                     15|                     b|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|      PERCENTILE_90|2017-12-31 00:00:00|   CREDITSCORE|                     17|                  ccc+|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
+---------------+-------------------+--------------------+-------------------+-------------------+--------------+-----------------------+----------------------+-----------+--------------+-------------------+----------------+---------------+---------------+

I tried below code

val pivot_df =  source_df.groupBy("model_family_id","classification_type","classification_value" ,"data_item_code","data_date","fiscal_year","fiscal_quarter" , "create_user_txt", "create_date")
                .pivot("benchmark_type_code" , 
                        Seq("mean","obs_cnt","obs_cnt_ca","percentile_0","percentile_10","percentile_25","percentile_50","percentile_75","percentile_90","percentile_100")
                      )
                .agg(  first(

                  when(  col("data_item_code") === "CREDITSCORE" ,  col("data_item_value_string"))
                  .otherwise(col("data_item_value_numeric"))
                )
              )  

I am getting below resutls, not sure what is wrong in my code.


+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+-------------+-------------+-------------+-------------+--------------+
|model_family_id|classification_type|classification_value|data_item_code|          data_date|fiscal_year|fiscal_quarter|create_user_txt|        create_date|mean|obs_cnt|obs_cnt_ca|percentile_0|percentile_10|percentile_25|percentile_50|percentile_75|percentile_90|percentile_100|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+-------------+-------------+-------------+-------------+--------------+
|              1|            COUNTRY|                 HKG|   CREDITSCORE|2017-12-31 00:00:00|       2017|             4|           LOAD|2018-03-31 14:04:18|null|   null|      null|        null|         null|         null|         null|         null|         null|          null|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+-------------+-------------+-------------+-------------+--------------+

I tried without Seq of columns in pivot function. But still it is not pivoting as expecting , any help please???

2) In when clause if the pivoted column is i.e $"benchmark_type_code" === 'OBS_CNT' | 'OBS_CNT' then it should take $data_item_value_numeric . how to achieve this ?


Solution

  • We can have when condition in a when condition like below it works fine.

    .agg(  first(
                      when(  col("data").isin("x","a","y","z")  ,
                       when(  col("code").isin("aa","bb")  ,  col("numeric")).otherwise(col("string"))
                              )
                     .otherwise(col("numeric"))
                    )