Search code examples
dataframescalaapache-sparkapache-spark-sql

How to trim the column values using Spark Dataframe


I have a dataframe like below, and I need to trim the values in SCHDULE column using spark dataframe. I tried with UDF functions but I didn't get expected output

SCHDULE ID VALUE
100H/10AR1 KL01 30
100H/10TR2 KL01 40
100H/22TR1 KL01 20
100H/22TR2 KL01 20
105JK/12PK1 AA05 10
105JK/12PK2 AA05 20
105JH/33PK3 AA05 50
105JH/33PK4 AA05 30
110P/1 BR03 20
110P/2 BR03 10

I need output like the below dataframe, can anyone pls help me on this

SCHDULE ID VALUE
100H/10AR1 KL01 30
100H/10TR2 KL01 40
100H/22TR1 KL01 20
100H/22TR2 KL01 20
105JK/12PK1 AA05 10
105JK/12PK2 AA05 20
105JH/33PK3 AA05 50
105JH/33PK4 AA05 30
110P/1 BR03 20
110P/2 BR03 10

Solution

  • Probably you dont need udf here, use function from Spark API, in this case regexp_extract may be usefull, below you can find sample code and regexp

    import org.apache.spark.sql.functions._
    
    val inputData = Seq(
      "100H/10AR1",
      "105J/33PK4",
      "110P/1"
    )
    
    val inputDf = inputData.toDF("SCHDULE")
    inputDf.withColumn("Trimmed", regexp_extract($"SCHDULE","""^(\d+[A-Z]?\/\d+).*""",1)).show
    

    Output:

    +----------+-------+
    |   SCHDULE|Trimmed|
    +----------+-------+
    |100H/10AR1|100H/10|
    |105J/33PK4|105J/33|
    |    110P/1| 110P/1|
    +----------+-------+