python apache-spark pyspark apache-spark-sql

Pyspark: Extract Multiple Values from a column into new columns based on Spaces and Hyphens

I have a dataframe with column forenames. I have to create a new column first_name which has first string of characters before the first space or if hyphen occurs in first string of characters prior to first space within forenames.

|forenames        | first_name |
+--------+---------------------+
|IVO-KAI ROGERS   |  IVO-KAI   |
|DYLAN STUART JOHN|  DYLAN     |
|JOSH JACK        |  JOSH      |
|MONALISA ELIEN   |  MONALISA  |
|RACHEL- GREEN JOE|RACHEL-GREEN|

I have to create another new column middle_name which will have second string of characters from forenames after the first space followed by first_name. Expected Output will be:

|forenames        | first_name | middle_name |
+--------+---------------------+-------------+
|IVO-KAI ROGERS   |  IVO-KAI   | ROGERS      |
|DYLAN STUART JOHN|  DYLAN     | STUART JOHN |
|JOSH JACK        |  JOSH      | JACK        |
|MONALISA ELIEN   |  MONALISA  | ELIEN       |
|RACHEL- GREEN JOE|RACHEL-GREEN| JOE         |

Solution

To split the forenames column into first_name and last_name based on the first space occurrence, you can use SPLIT and SUBSTRING_INDEX functions in Spark SQL. Here's how you can achieve this.

scala> val dfa = spark.sql("SELECT  forenames, SUBSTRING_INDEX(forenames, ' ', 1) AS first_name, SUBSTRING(forenames, LENGTH(SUBSTRING_INDEX(forenames, ' ', 1)) + 2) AS last_name FROM df")
dfa: org.apache.spark.sql.DataFrame = [forenames: string, first_name: string ... 1 more field]

scala> dfa.show(false)

+-----------------+----------+-----------+
|forenames        |first_name|last_name  |
+-----------------+----------+-----------+
|IVO-KAI ROGERS   |IVO-KAI   |ROGERS     |
|DYLAN STUART JOHN|DYLAN     |STUART JOHN|
|JOSH JACK        |JOSH      |JACK       |
|MONALISA ELIEN   |MONALISA  |ELIEN      |
+-----------------+----------+-----------+