Search code examples
scalaapache-sparkapache-spark-sqltrim

Unexpected result of rtrim function in Spark


I'm trying to understand the problem I get with Spark sql function rtrim. I'm trying to remove a substring from the string and the result is not as I expected. When I try to remove _count from test_count, I get tes instead of test. But if I try to remove _count from TEST_count, I get TEST as expected.

My sample code:

import spark.implicits._
import org.apache.spark.sql.functions.rtrim

case class Test(test_count: String, test1_count: String)
val df = Seq(
    Test("test_count", "TEST_count")
).toDF

df
    .withColumn("test", rtrim($"test_count", "_count"))
    .withColumn("test1", rtrim($"test1_count", "_count"))
    .withColumn("test_", rtrim($"test_count", "count"))
    .withColumn("test1_", rtrim($"test1_count", "count"))
    .show

Its result:

+----------+-----------+----+-----+-----+------+
|test_count|test1_count|test|test1|test_|test1_|
+----------+-----------+----+-----+-----+------+
|test_count| TEST_count| tes| TEST|test_| TEST_|
+----------+-----------+----+-----+-----+------+

I was trying to add \\ before the _count but it didn't work. I found the other way to achieve the expected result but I want to understand better how trim works in Spark. Am I doing something wrong or the behaviour is indeed unexpected?


Solution

  • This is correct behaviour. rtrim removes all occurences of the specified chars on the right side of the string, NOT the specified string. So if you have test_count and you rtrim _count, it will remove every character in _, c, o, u, n, t starting from the right until it finds a character that is not one of them. This results in the removal of the last t in test, but not s because s is not in the list.

    test_count
       ^^^^^^^ all these are in [`_`, `c`, `o`, `u`, `n`, `t`]
      ^ but this isn't, so the string is trimmed until here
    

    To achieve what you wanted, you can use regexp_replace to replace the last bit with an empty string. $ in regex means the end (right side) of the string. If you want to do something similar with ltrim you can use ^ in regex.

    df
        .withColumn("test", regexp_replace($"test_count", "_count$", ""))
        .withColumn("test1", regexp_replace($"test1_count", "_count$", ""))
        .withColumn("test_", regexp_replace($"test_count", "count$", ""))
        .withColumn("test1_", regexp_replace($"test1_count", "count$", ""))
        .show
    +----------+-----------+----+-----+-----+------+
    |test_count|test1_count|test|test1|test_|test1_|
    +----------+-----------+----+-----+-----+------+
    |test_count| TEST_count|test| TEST|test_| TEST_|
    +----------+-----------+----+-----+-----+------+