Search code examples
pythonpysparkapache-spark-sqlaws-glue

Pyspark: Is there a function to split dataframe column values on the basis of comma


input

+--------------+-----------------------+-----------------------+
|ID            |Subject                |Marks                  |
+--------------+-----------------------+-----------------------+
|1             |maths,physics          |80,90                  |
|2             |Computer               |73                     |
|3             |music,sports,chemistry |76,89,85               |
+--------------+-----------+-----------+-----------------------+

Expected output

+--------------+-----------------------+-----------------------+
|ID            |Subject                |Marks                  |
+--------------+-----------------------+-----------------------+
|1             |maths                  |80                     |
|1             |physics                |90                     |
|2             |Computer               |73                     |
|3             |music                  |76                     |
|3             |sports                 |89                     |
|3             |chemistry              |85                     |
+--------------+-----------+-----------+-----------------------+

need help in getting this expected output ,have already tried explode function but that only works on single column


Solution

  • Another way;, split the columns on , to form arrays. Zip the arrays and leverage pysparks' inline function to achieve what you want

    df.withColumn('Subject', split(col("Subject"),",")).withColumn('Marks', split(col("Marks"),",")).selectExpr('ID','inline(arrays_zip(Subject,Marks))')
    
    +---+---------+-----+
    | ID|  Subject|Marks|
    +---+---------+-----+
    |  1|    maths|   80|
    |  1|  physics|   90|
    |  2| Computer|   73|
    |  3|    music|   76|
    |  3|   sports|   89|
    |  3|chemistry|   85|
    +---+---------+-----+