I have a use case with this data:
1. "apple+case"
2. "apple+case+10+cover"
3. "apple+case+10++cover"
4. "+apple"
5. "iphone8+"
Currently, I am doing this to replace the + with space as follows:
def normalizer(value: String): String = {
if (value == null) {
null
} else {
value.replaceAll("\\+", BLANK_SPACE)
}
}
val testUDF = udf(normalizer(_: String): String)
df.withColumn("newCol", testUDF($"value"))
But this is replacing all "+". How do I replace "+" that comes between strings while also handling use cases like: "apple+case+10++cover" => "apple case 10+ cover"?
The output should be
1. "apple case"
2. "apple case 10 cover"
3. "apple case 10+ cover"
4. "apple"
5. "iphone8+"
You can use regexp_replace
to do this instead of a udf, it should be much faster. For most of the cases, you can use negative lookahead in the regexp, but for "+apple" you actually want to replace "+" with "" (and not a space). The easiest way is to simply use to regexps.
df.withColumn("newCol", regexp_replace($"value", "^\\+", ""))
.withColumn("newCol", regexp_replace($"newCol", "\\+(?!\\+|$)", " "))
This will give:
+--------------------+--------------------+
|value |newCol |
+--------------------+--------------------+
|apple+case |apple case |
|apple+case+10+cover |apple case 10 cover |
|apple+case+10++cover|apple case 10+ cover|
|+apple |apple |
|iphone8+ |iphone8+ |
+--------------------+--------------------+
To make this more modular and reusable, you can define it as a function:
def normalizer(c: String) = regexp_replace(regexp_replace(col(c), "^\\+", ""), "\\+(?!\\+|$)", " ")
df.withColumn("newCol", normalizer("value"))