Can you extract a series of letters and numbers from bad freeform data in a dataframe?
I want to create a new column in the data frame with data that contains 'NEX' and a series of numbers after it.
import pandas as pd
#Create a Dataframe
data = {
'ID':[1,2,3,4,5],
'PROGRAM': [ 'nbu 123456',
'NBU-123456',
'nex999999 b12',
'NXE999999 123',
'NBU123456 NEX999999']
}
df = pd.DataFrame(data)
I think I'm on the right lines with the below, but I somehow need to combine their functionality:-
print(df['PROGRAM'].str.contains('NEX', na=False))
# does not deal with lower case & contains letters NEX not nessary in that order
print(df['PROGRAM'].str.extract(r'([NEX]+\d+)', expand=False))
The result should only bring back NEX999999 (including converting lowercase to uppercase)
df['NEX'] = df['PROGRAM'].str.blahblahblah
You do not need to pre-filter the rows with NEX
, your regex can already do that.
One issue with your regex is [NEX]
, this allows a single character from the N/E/X set, which is not what you want. [NEX]+
would match N
or EEEN
.
You should use:
df['out'] = df['PROGRAM'].str.upper().str.extract(r'(NEX\d+)', expand=False)
And if you do not care about lower/uppercase:
import re
df['out2'] = df['PROGRAM'].str.extract(r'(NEX\d+)', expand=False, flags=re.I)
And if you just want the digits:
df['out3'] = df['PROGRAM'].str.extract(r'NEX(\d+)', expand=False, flags=re.I)
Output:
ID PROGRAM out out2 out3
0 1 nbu 123456 NaN NaN NaN
1 2 NBU-123456 NaN NaN NaN
2 3 nex999999 b12 NEX999999 nex999999 999999
3 4 NXE999999 123 NaN NaN NaN
4 5 NBU123456 NEX999999 NEX999999 NEX999999 999999