I have a dataframe like below, and I need to trim the values in SCHDULE column using spark dataframe. I tried with UDF functions but I didn't get expected output
SCHDULE | ID | VALUE |
---|---|---|
100H/10AR1 | KL01 | 30 |
100H/10TR2 | KL01 | 40 |
100H/22TR1 | KL01 | 20 |
100H/22TR2 | KL01 | 20 |
105JK/12PK1 | AA05 | 10 |
105JK/12PK2 | AA05 | 20 |
105JH/33PK3 | AA05 | 50 |
105JH/33PK4 | AA05 | 30 |
110P/1 | BR03 | 20 |
110P/2 | BR03 | 10 |
I need output like the below dataframe, can anyone pls help me on this
SCHDULE | ID | VALUE |
---|---|---|
100H/10AR1 | KL01 | 30 |
100H/10TR2 | KL01 | 40 |
100H/22TR1 | KL01 | 20 |
100H/22TR2 | KL01 | 20 |
105JK/12PK1 | AA05 | 10 |
105JK/12PK2 | AA05 | 20 |
105JH/33PK3 | AA05 | 50 |
105JH/33PK4 | AA05 | 30 |
110P/1 | BR03 | 20 |
110P/2 | BR03 | 10 |
Probably you dont need udf here, use function from Spark API, in this case regexp_extract may be usefull, below you can find sample code and regexp
import org.apache.spark.sql.functions._
val inputData = Seq(
"100H/10AR1",
"105J/33PK4",
"110P/1"
)
val inputDf = inputData.toDF("SCHDULE")
inputDf.withColumn("Trimmed", regexp_extract($"SCHDULE","""^(\d+[A-Z]?\/\d+).*""",1)).show
Output:
+----------+-------+
| SCHDULE|Trimmed|
+----------+-------+
|100H/10AR1|100H/10|
|105J/33PK4|105J/33|
| 110P/1| 110P/1|
+----------+-------+