I have a dataframe with column forenames. I have to create a new column first_name which has first string of characters before the first space or if hyphen occurs in first string of characters prior to first space within forenames.
|forenames | first_name |
+--------+---------------------+
|IVO-KAI ROGERS | IVO-KAI |
|DYLAN STUART JOHN| DYLAN |
|JOSH JACK | JOSH |
|MONALISA ELIEN | MONALISA |
|RACHEL- GREEN JOE|RACHEL-GREEN|
I have to create another new column middle_name which will have second string of characters from forenames after the first space followed by first_name. Expected Output will be:
|forenames | first_name | middle_name |
+--------+---------------------+-------------+
|IVO-KAI ROGERS | IVO-KAI | ROGERS |
|DYLAN STUART JOHN| DYLAN | STUART JOHN |
|JOSH JACK | JOSH | JACK |
|MONALISA ELIEN | MONALISA | ELIEN |
|RACHEL- GREEN JOE|RACHEL-GREEN| JOE |
To split the forenames column into first_name and last_name based on the first space occurrence, you can use SPLIT and SUBSTRING_INDEX functions in Spark SQL. Here's how you can achieve this.
scala> val dfa = spark.sql("SELECT forenames, SUBSTRING_INDEX(forenames, ' ', 1) AS first_name, SUBSTRING(forenames, LENGTH(SUBSTRING_INDEX(forenames, ' ', 1)) + 2) AS last_name FROM df")
dfa: org.apache.spark.sql.DataFrame = [forenames: string, first_name: string ... 1 more field]
scala> dfa.show(false)
+-----------------+----------+-----------+
|forenames |first_name|last_name |
+-----------------+----------+-----------+
|IVO-KAI ROGERS |IVO-KAI |ROGERS |
|DYLAN STUART JOHN|DYLAN |STUART JOHN|
|JOSH JACK |JOSH |JACK |
|MONALISA ELIEN |MONALISA |ELIEN |
+-----------------+----------+-----------+