I have this dataframe with a column of strings:
Column A |
---|
AB-001-1-12345-A |
AB-001-1-12346-B |
ABC012345B |
ABC012346B |
In PySpark, I want to create a new column where if there is "AB-" in front, the new column remove the characters "AB-" and keep the rest of the characters. Otherwise, the strings should remain the same.
Expected Output:
Column A | Column B |
---|---|
AB-001-1-12345-A | 001-1-12345-A |
AB-001-1-12346-B | 001-1-12346-B |
ABC012345B | ABC012345B |
ABC012346B | ABC012346B |
Hope this works for you
from pyspark.sql.functions import *
df = df.withColumn("col_b",when(col("col_a").startswith("AB-") , split(col("col_a"),"AB-").getItem(1)).otherwise(col("col_a")))
df.show()
Output
+----------------+-------------+
| col_a| col_b|
+----------------+-------------+
|AB-001-1-12345-A|001-1-12345-A|
|AB-001-1-12346-B|001-1-12346-B|
| ABC012345B| ABC012345B|
| ABC012346B| ABC012346B|
+----------------+-------------+