Search code examples
regexpysparkregexp-replace

How to find and replace space between digits in a string column?


I need to find and replace any space between digits in a long string using regular expression.

I have tried to use regular expression such as [0-9][\s][0-9] and then regexp_replace such as .withColumn('free_text', regexp_replace('free_text', '[0-9][\s][0-9]', '')). However, the regex matches 1(space)4 where I would like to have only (space)

Here is an example:

What I have: "Hello. I am Marie. My number is 768 990"

What I would like to have: "Hello. I am Marie. My number is 768990"

Thanks,


Solution

  • Here is one way to do this, using capture groups:

    .withColumn('free_text', regexp_replace('free_text', '([0-9])\s([0-9])', '$1$2'))
    

    The idea here is to match and capture the two digits separated by a whitespace character in between them. Then, we can replace by just the two digits adjacent.