Search code examples
pythonapache-sparkpysparkapache-spark-sql

Pyspark: Extract Multiple Values from a column into new columns based on Spaces and Hyphens


I have a dataframe with column forenames. I have to create a new column first_name which has first string of characters before the first space or if hyphen occurs in first string of characters prior to first space within forenames.

|forenames        | first_name |
+--------+---------------------+
|IVO-KAI ROGERS   |  IVO-KAI   |
|DYLAN STUART JOHN|  DYLAN     |
|JOSH JACK        |  JOSH      |
|MONALISA ELIEN   |  MONALISA  |
|RACHEL- GREEN JOE|RACHEL-GREEN|

I have to create another new column middle_name which will have second string of characters from forenames after the first space followed by first_name. Expected Output will be:

|forenames        | first_name | middle_name |
+--------+---------------------+-------------+
|IVO-KAI ROGERS   |  IVO-KAI   | ROGERS      |
|DYLAN STUART JOHN|  DYLAN     | STUART JOHN |
|JOSH JACK        |  JOSH      | JACK        |
|MONALISA ELIEN   |  MONALISA  | ELIEN       |
|RACHEL- GREEN JOE|RACHEL-GREEN| JOE         |

Solution

  • To split the forenames column into first_name and last_name based on the first space occurrence, you can use SPLIT and SUBSTRING_INDEX functions in Spark SQL. Here's how you can achieve this.

    scala> val dfa = spark.sql("SELECT  forenames, SUBSTRING_INDEX(forenames, ' ', 1) AS first_name, SUBSTRING(forenames, LENGTH(SUBSTRING_INDEX(forenames, ' ', 1)) + 2) AS last_name FROM df")
    dfa: org.apache.spark.sql.DataFrame = [forenames: string, first_name: string ... 1 more field]
    
    scala> dfa.show(false)
    
    +-----------------+----------+-----------+
    |forenames        |first_name|last_name  |
    +-----------------+----------+-----------+
    |IVO-KAI ROGERS   |IVO-KAI   |ROGERS     |
    |DYLAN STUART JOHN|DYLAN     |STUART JOHN|
    |JOSH JACK        |JOSH      |JACK       |
    |MONALISA ELIEN   |MONALISA  |ELIEN      |
    +-----------------+----------+-----------+