Search code examples
regexpysparkreplace

Using regex to find a pattern and then replace with fillers and pushing all the digits at the end


I am using regex function. I have some unique numbers that I am trying to replace with some characters to increase the match between two variables. See one of my example below.

The example: I want APKC3475 to be replaced by AK113475. In order to do that I am using the following regex code:

df = df.withColumn('columnname',regexp_replace('columnname', '^APKC', 'AK11'))

By using this code it will replace all similar unique numbers that starts with APKC to AK11 and retains the last four characters as it is.

I want to generate a similar regex code for a unique number QT156UOE which should be converted to QT111156. Another similar example is QT265UOE to be converted to QT111265.

Basically, I want to convert the format of the unique numbers from QTXXXUOE to QT111XXX, where XXX represents the last three digits of the original number. Can someone please help with the regex code for it, preferably using PySpark.

I tried the below solution:

df = df.withColumn('columnname',regexp_replace('columnname', '^QT...UOE', 'QT111...'))

It is not working. May be it is a silly question but I just started to learn coding one month before. Thank you for your patience and support.


Solution

  • Try the following: (https://regex101.com/r/8my927/1)

    df = df.withColumn('columnname',regexp_replace('columnname', '^QT(...)UOE', 'QT111\1'))
    

    The brackets in ^QT(...)... store the contents of the first 3 characters after QT into \1. Then QT111\1 places those saved characters at the end of the ID after QT111.