Search code examples
pythondataframeapache-sparkpysparkregexp-replace

Pyspark - Replace portion of a string with different characters (uneven character count)


I'm trying to replace a portion of a string with a different and shorter string which has : and +. Even though the values under the Start column is time, it is not a timestamp and instead it is recognised as a string.

I've tried using regexp_replace but currently don't know how to specify the last 8 characters in the string in the 'Start' column that needs to be replaced or specify the string that I want to replace with the new one.

df = df.withColumn('replaced', regexp_replace('Start', ':00+10:00', '00Z' ))

For example: I would start with the dataframe with columns:

| Data | Qualifier | Description | Start                    |
|:----:|:---------:|:-----------:|:------------------------:|
|12    | 54        | apple       |2021-03-03T02:00:00+10:00 |
|24    | 32        | banana      |2021-03-04T22:30:00+10:00 |
|24    | 32        | orange      |2021-03-04T11:58:00+10:00 |

And would like to have the Start column replaced with a new Column named 'Replaced' or retain 'Start' column header.

Expected Output:

| Data | Qualifier | Description | Replaced                 |
|:----:|:---------:|:-----------:|:------------------------:|
|12    | 54        | apple       |2021-03-03T02:00:00Z      |
|24    | 32        | banana      |2021-03-04T22:30:00Z      |
|24    | 32        | orange      |2021-03-04T11:58:00Z      |

Solution

  • You can convert the column type to timestamp using to_timestamp() function by providing input timestamp pattern.

    If you really want to use regexp_replace then in your regex pattern you have to escape + using \

    >>> df = spark.createDataFrame(['2021-03-03T02:00:00+10:00'], StringType())
    >>> df.withColumn('replaced', regexp_replace("value", "\+\d+:\d+", "Z")).show(truncate=False)
    
    +-------------------------+--------------------+
    |value                    |replaced            |
    +-------------------------+--------------------+
    |2021-03-03T02:00:00+10:00|2021-03-03T02:00:00Z|
    +-------------------------+--------------------+