Search code examples
pysparkhivepivot

Hive/pyspark: pivot non numeric data for huge dataset


I'm looking for a way to pivot a input dataset with the below structure in hive or pyspark, the input contains more than half a billion records and for each emp_id there are 8 rows with and 5 columns possible, so I will end up with 40 columns. I did refer to this link but here the pivoted output column is already there in the dataset, in mine it's not and I also tried this link, but the sql is becoming very huge (not that it matters), but Is there a much way to do where the resultant pivoted columns needs to concatenated with the rank.

input

emp_id,  dept_id,   dept_name, rank
1001,   101,        sales,      1
1001,   102,        marketing,  2
1002    101,        sales       1
1002    102,        marketing,  2

expected output

emp_id,     dept_id_1, dept_name_1, dept_id_2, dept_id_2
1001,       101,        sales,      102,        marketing
1002,       101,        sales,      102,        marketing

Solution

  • You can use aggregations after pivoting, you'd have an option to rename column like so

    import pyspark.sql.functions as F
    
    (df
        .groupBy('emp_id')
        .pivot('rank')
        .agg(
            F.first('dept_id').alias('dept_id'),
            F.first('dept_name').alias('dept_name')
        )
        .show()
    )
    
    # Output
    # +------+---------+-----------+---------+-----------+
    # |emp_id|1_dept_id|1_dept_name|2_dept_id|2_dept_name|
    # +------+---------+-----------+---------+-----------+
    # |  1002|      101|      sales|      102|  marketing|
    # |  1001|      101|      sales|      102|  marketing|
    # +------+---------+-----------+---------+-----------+