regex scala apache-spark regex-lookarounds regexp-replace

How do I replace a delimiter that appears only in between something?

I have a use case with this data:

1. "apple+case"
2. "apple+case+10+cover"
3. "apple+case+10++cover"
4. "+apple"
5. "iphone8+"

Currently, I am doing this to replace the + with space as follows:

def normalizer(value: String): String = {
    if (value == null) {
      null
    } else {
       value.replaceAll("\\+", BLANK_SPACE)        
     }
  }

  val testUDF = udf(normalizer(_: String): String)

  df.withColumn("newCol",  testUDF($"value"))

But this is replacing all "+". How do I replace "+" that comes between strings while also handling use cases like: "apple+case+10++cover" => "apple case 10+ cover"?

The output should be
1. "apple case"
2. "apple case 10 cover"
3. "apple case 10+ cover"
4. "apple"
5. "iphone8+"

Solution

You can use regexp_replace to do this instead of a udf, it should be much faster. For most of the cases, you can use negative lookahead in the regexp, but for "+apple" you actually want to replace "+" with "" (and not a space). The easiest way is to simply use to regexps.

df.withColumn("newCol", regexp_replace($"value", "^\\+", ""))
  .withColumn("newCol", regexp_replace($"newCol", "\\+(?!\\+|$)", " "))

This will give:

+--------------------+--------------------+
|value               |newCol              |
+--------------------+--------------------+
|apple+case          |apple case          |
|apple+case+10+cover |apple case 10 cover |
|apple+case+10++cover|apple case 10+ cover|
|+apple              |apple               |
|iphone8+            |iphone8+            |
+--------------------+--------------------+

To make this more modular and reusable, you can define it as a function:

def normalizer(c: String) = regexp_replace(regexp_replace(col(c), "^\\+", ""), "\\+(?!\\+|$)", " ")

df.withColumn("newCol", normalizer("value"))