data = {'desc': ['ADRIAN PETER - ANN 80020355787C - 11 Baillon Pass.pdf', 'AILEEN MARCUS - ANC 800E15432922 - 5 Mandarin Way.pdf',
'AJITH SINGH - ANN 80020837750 - 11 Berkeley Loop.pdf', 'ALEX MARTIN-CURTIS - ANC 80021710355 - 26 Dovedale St.pdf',
'Alice.Smith\Jodee - Karen - ANE 80020428377 - 58 Harrisdale Dr.pdf']}
df = pd.DataFrame(data, columns = ['desc'])
df
From the data frame, I want to create a new column called ID, and in that ID, I want to have only those values starting after ANN, ANC or ANE. So I am expecting a result as below.
ID
80020355787C
800E15432922
80020837750
80021710355
80020428377
I tried running the code below, but it did not get the desired result. Appreciate your help on this.
df['id'] = df['desc'].str.extract(r'\-([^|]+)\-')
You can use - AN[NCE] (800[0-9A-Z]+) -
, where:
AN[NCE]
matches literally AN
followed by N
or C
or E
;800[0-9A-Z]+
matches literally 800
followed by one or more characters between 0
and 9
or between A
and Z
.>>> df['desc'].str.extract(r'- AN[NCE] (800[0-9A-Z]+) -')
0
0 80020355787C
1 800E15432922
2 80020837750
3 80021710355
4 80020428377
If not all your ids start with "800", you can just remove it from the pattern.