Search code examples
pysparksubstring

manipulating string if string starts with specific characters pyspark


I have this dataframe with a column of strings:

Column A
AB-001-1-12345-A
AB-001-1-12346-B
ABC012345B
ABC012346B

In PySpark, I want to create a new column where if there is "AB-" in front, the new column remove the characters "AB-" and keep the rest of the characters. Otherwise, the strings should remain the same.

Expected Output:

Column A Column B
AB-001-1-12345-A 001-1-12345-A
AB-001-1-12346-B 001-1-12346-B
ABC012345B ABC012345B
ABC012346B ABC012346B

Solution

  • Hope this works for you

    from pyspark.sql.functions import *
    df = df.withColumn("col_b",when(col("col_a").startswith("AB-") , split(col("col_a"),"AB-").getItem(1)).otherwise(col("col_a")))
    df.show()
    

    Output

    +----------------+-------------+
    |           col_a|        col_b|
    +----------------+-------------+
    |AB-001-1-12345-A|001-1-12345-A|
    |AB-001-1-12346-B|001-1-12346-B|
    |      ABC012345B|   ABC012345B|
    |      ABC012346B|   ABC012346B|
    +----------------+-------------+