Search code examples
helperdelete-rowcpu-wordpyspark

remove specific words into a dataframe with pyspark


I have a DataFrame

+------+--------------------+-----------------+----
|   id| titulo       |tipo      | formacion       |
+------+--------------------+-----------------+----
|32084|A             | Material | VION00001 TRADE |
|32350|B             | Curso    | CUS11222  LEADER|
|32362|C             | Curso    | ITIN9876  EVALUA|   
|32347|D             | Curso    | CUMPLI VION1234 |      
|32036|E             | Curso    | EVAN1111  INFORM|   

I need, that into formacion column remove the characters that start with VION|CUS|ITIN|VION|EVAN so Dataframe looks like

+------+--------------------+-----------------+----
|   id| titulo       |tipo      | formacion       |
+------+--------------------+-----------------+----
|32084|A             | Material |  TRADE          |
|32350|B             | Curso    |  LEADER         |
|32362|C             | Curso    |  EVALUA         |   
|32347|D             | Curso    |  CUMPLI         |      
|32036|E             | Curso    |  INFORM         |  
+------+--------------------+-----------------+----

Thank you for your help


Solution

  • Use split function to split the column by space then get the last element of array.

    • from Spark2.4+ use element_at function
    • for Spark < 2.4 use reverse(split(array))[0]

    #using element_at
    df.withColumn("formacion",element_at(split(col("formacion"),"\\s"),-1)).show() 
    
    #or using array_index
    df.withColumn("formacion",split(col("formacion"),"\\s")[1]).show()
    
    #split reverse and get first index value
    df.withColumn("formacion",reverse(split(col("formacion"),"\\s"))[0]).show()
    
    #+-----+--------------+----------+-------------+
    #|   id|titulo        |tipo      | formacion   |
    #+------+--------------------+-----------------+
    #|32084|A             | Material |  TRADE      |
    #|32350|B             | Curso    |  LEADER     |
    #|32362|C             | Curso    |  EVALUA     |   
    #|32347|D             | Curso    |  CUMPLI     |      
    #|32036|E             | Curso    |  INFORM     |  
    #+-----+--------------+----------+-------------+