I am using regex function. I have some unique numbers that I am trying to replace with some characters to increase the match between two variables. See one of my example below.
The example: I want APKC3475
to be replaced by AK113475
. In order to do that I am using the following regex code:
df = df.withColumn('columnname',regexp_replace('columnname', '^APKC', 'AK11'))
By using this code it will replace all similar unique numbers that starts with APKC
to AK11
and retains the last four characters as it is.
I want to generate a similar regex code for a unique number QT156UOE
which should be converted to QT111156
. Another similar example is QT265UOE
to be converted to QT111265
.
Basically, I want to convert the format of the unique numbers from QTXXXUOE
to QT111XXX
, where XXX
represents the last three digits of the original number. Can someone please help with the regex code for it, preferably using PySpark.
I tried the below solution:
df = df.withColumn('columnname',regexp_replace('columnname', '^QT...UOE', 'QT111...'))
It is not working. May be it is a silly question but I just started to learn coding one month before. Thank you for your patience and support.
Try the following: (https://regex101.com/r/8my927/1)
df = df.withColumn('columnname',regexp_replace('columnname', '^QT(...)UOE', 'QT111\1'))
The brackets in ^QT(...)...
store the contents of the first 3 characters after QT into \1
. Then QT111\1
places those saved characters at the end of the ID after QT111.