Search code examples
regexscalaapache-sparkregex-lookaroundsregexp-replace

How do I replace a delimiter that appears only in between something?


I have a use case with this data:

1. "apple+case"
2. "apple+case+10+cover"
3. "apple+case+10++cover"
4. "+apple"
5. "iphone8+"

Currently, I am doing this to replace the + with space as follows:

def normalizer(value: String): String = {
    if (value == null) {
      null
    } else {
       value.replaceAll("\\+", BLANK_SPACE)        
     }
  }

  val testUDF = udf(normalizer(_: String): String)

  df.withColumn("newCol",  testUDF($"value"))

But this is replacing all "+". How do I replace "+" that comes between strings while also handling use cases like: "apple+case+10++cover" => "apple case 10+ cover"?

The output should be
1. "apple case"
2. "apple case 10 cover"
3. "apple case 10+ cover"
4. "apple"
5. "iphone8+"

Solution

  • You can use regexp_replace to do this instead of a udf, it should be much faster. For most of the cases, you can use negative lookahead in the regexp, but for "+apple" you actually want to replace "+" with "" (and not a space). The easiest way is to simply use to regexps.

    df.withColumn("newCol", regexp_replace($"value", "^\\+", ""))
      .withColumn("newCol", regexp_replace($"newCol", "\\+(?!\\+|$)", " "))
    

    This will give:

    +--------------------+--------------------+
    |value               |newCol              |
    +--------------------+--------------------+
    |apple+case          |apple case          |
    |apple+case+10+cover |apple case 10 cover |
    |apple+case+10++cover|apple case 10+ cover|
    |+apple              |apple               |
    |iphone8+            |iphone8+            |
    +--------------------+--------------------+
    

    To make this more modular and reusable, you can define it as a function:

    def normalizer(c: String) = regexp_replace(regexp_replace(col(c), "^\\+", ""), "\\+(?!\\+|$)", " ")
    
    df.withColumn("newCol", normalizer("value"))