I'm trying to understand the problem I get with Spark sql function rtrim
. I'm trying to remove a substring from the string and the result is not as I expected. When I try to remove _count
from test_count
, I get tes
instead of test
. But if I try to remove _count
from TEST_count
, I get TEST
as expected.
My sample code:
import spark.implicits._
import org.apache.spark.sql.functions.rtrim
case class Test(test_count: String, test1_count: String)
val df = Seq(
Test("test_count", "TEST_count")
).toDF
df
.withColumn("test", rtrim($"test_count", "_count"))
.withColumn("test1", rtrim($"test1_count", "_count"))
.withColumn("test_", rtrim($"test_count", "count"))
.withColumn("test1_", rtrim($"test1_count", "count"))
.show
Its result:
+----------+-----------+----+-----+-----+------+
|test_count|test1_count|test|test1|test_|test1_|
+----------+-----------+----+-----+-----+------+
|test_count| TEST_count| tes| TEST|test_| TEST_|
+----------+-----------+----+-----+-----+------+
I was trying to add \\
before the _count
but it didn't work. I found the other way to achieve the expected result but I want to understand better how trim
works in Spark. Am I doing something wrong or the behaviour is indeed unexpected?
This is correct behaviour. rtrim
removes all occurences of the specified chars on the right side of the string, NOT the specified string. So if you have test_count
and you rtrim
_count
, it will remove every character in _
, c
, o
, u
, n
, t
starting from the right until it finds a character that is not one of them. This results in the removal of the last t
in test
, but not s
because s
is not in the list.
test_count
^^^^^^^ all these are in [`_`, `c`, `o`, `u`, `n`, `t`]
^ but this isn't, so the string is trimmed until here
To achieve what you wanted, you can use regexp_replace
to replace the last bit with an empty string. $
in regex means the end (right side) of the string. If you want to do something similar with ltrim
you can use ^
in regex.
df
.withColumn("test", regexp_replace($"test_count", "_count$", ""))
.withColumn("test1", regexp_replace($"test1_count", "_count$", ""))
.withColumn("test_", regexp_replace($"test_count", "count$", ""))
.withColumn("test1_", regexp_replace($"test1_count", "count$", ""))
.show
+----------+-----------+----+-----+-----+------+
|test_count|test1_count|test|test1|test_|test1_|
+----------+-----------+----+-----+-----+------+
|test_count| TEST_count|test| TEST|test_| TEST_|
+----------+-----------+----+-----+-----+------+