scala apache-spark apache-spark-sql trim

Unexpected result of rtrim function in Spark

I'm trying to understand the problem I get with Spark sql function rtrim. I'm trying to remove a substring from the string and the result is not as I expected. When I try to remove _count from test_count, I get tes instead of test. But if I try to remove _count from TEST_count, I get TEST as expected.

My sample code:

import spark.implicits._
import org.apache.spark.sql.functions.rtrim

case class Test(test_count: String, test1_count: String)
val df = Seq(
    Test("test_count", "TEST_count")
).toDF

df
    .withColumn("test", rtrim($"test_count", "_count"))
    .withColumn("test1", rtrim($"test1_count", "_count"))
    .withColumn("test_", rtrim($"test_count", "count"))
    .withColumn("test1_", rtrim($"test1_count", "count"))
    .show

Its result:

+----------+-----------+----+-----+-----+------+
|test_count|test1_count|test|test1|test_|test1_|
+----------+-----------+----+-----+-----+------+
|test_count| TEST_count| tes| TEST|test_| TEST_|
+----------+-----------+----+-----+-----+------+

I was trying to add \\ before the _count but it didn't work. I found the other way to achieve the expected result but I want to understand better how trim works in Spark. Am I doing something wrong or the behaviour is indeed unexpected?

Solution

This is correct behaviour. rtrim removes all occurences of the specified chars on the right side of the string, NOT the specified string. So if you have test_count and you rtrim _count, it will remove every character in _, c, o, u, n, t starting from the right until it finds a character that is not one of them. This results in the removal of the last t in test, but not s because s is not in the list.

test_count
   ^^^^^^^ all these are in [`_`, `c`, `o`, `u`, `n`, `t`]
  ^ but this isn't, so the string is trimmed until here

To achieve what you wanted, you can use regexp_replace to replace the last bit with an empty string. $ in regex means the end (right side) of the string. If you want to do something similar with ltrim you can use ^ in regex.

df
    .withColumn("test", regexp_replace($"test_count", "_count$", ""))
    .withColumn("test1", regexp_replace($"test1_count", "_count$", ""))
    .withColumn("test_", regexp_replace($"test_count", "count$", ""))
    .withColumn("test1_", regexp_replace($"test1_count", "count$", ""))
    .show
+----------+-----------+----+-----+-----+------+
|test_count|test1_count|test|test1|test_|test1_|
+----------+-----------+----+-----+-----+------+
|test_count| TEST_count|test| TEST|test_| TEST_|
+----------+-----------+----+-----+-----+------+