python pyspark apache-spark-sql aws-glue

Pyspark: Is there a function to split dataframe column values on the basis of comma

input

+--------------+-----------------------+-----------------------+
|ID            |Subject                |Marks                  |
+--------------+-----------------------+-----------------------+
|1             |maths,physics          |80,90                  |
|2             |Computer               |73                     |
|3             |music,sports,chemistry |76,89,85               |
+--------------+-----------+-----------+-----------------------+

Expected output

+--------------+-----------------------+-----------------------+
|ID            |Subject                |Marks                  |
+--------------+-----------------------+-----------------------+
|1             |maths                  |80                     |
|1             |physics                |90                     |
|2             |Computer               |73                     |
|3             |music                  |76                     |
|3             |sports                 |89                     |
|3             |chemistry              |85                     |
+--------------+-----------+-----------+-----------------------+

need help in getting this expected output ,have already tried explode function but that only works on single column

Solution

Another way;, split the columns on , to form arrays. Zip the arrays and leverage pysparks' inline function to achieve what you want

df.withColumn('Subject', split(col("Subject"),",")).withColumn('Marks', split(col("Marks"),",")).selectExpr('ID','inline(arrays_zip(Subject,Marks))')

+---+---------+-----+
| ID|  Subject|Marks|
+---+---------+-----+
|  1|    maths|   80|
|  1|  physics|   90|
|  2| Computer|   73|
|  3|    music|   76|
|  3|   sports|   89|
|  3|chemistry|   85|
+---+---------+-----+