Search code examples
pythonpandasdataframe

Extract certain word (case-insensitive) followed by numbers from Pandas df


Can you extract a series of letters and numbers from bad freeform data in a dataframe?

I want to create a new column in the data frame with data that contains 'NEX' and a series of numbers after it.

import pandas as pd

#Create a Dataframe
data = {
    'ID':[1,2,3,4,5],
    'PROGRAM': [ 'nbu 123456',
                'NBU-123456',
                'nex999999 b12',
                'NXE999999 123',
                'NBU123456 NEX999999']
}

df = pd.DataFrame(data)

I think I'm on the right lines with the below, but I somehow need to combine their functionality:-

print(df['PROGRAM'].str.contains('NEX', na=False))
# does not deal with lower case & contains letters NEX not nessary in that order
print(df['PROGRAM'].str.extract(r'([NEX]+\d+)', expand=False))

The result should only bring back NEX999999 (including converting lowercase to uppercase)

df['NEX'] = df['PROGRAM'].str.blahblahblah

Solution

  • You do not need to pre-filter the rows with NEX, your regex can already do that.

    One issue with your regex is [NEX], this allows a single character from the N/E/X set, which is not what you want. [NEX]+ would match N or EEEN.

    You should use:

    df['out'] = df['PROGRAM'].str.upper().str.extract(r'(NEX\d+)', expand=False)
    

    And if you do not care about lower/uppercase:

    import re
    
    df['out2'] = df['PROGRAM'].str.extract(r'(NEX\d+)', expand=False, flags=re.I)
    

    And if you just want the digits:

    df['out3'] = df['PROGRAM'].str.extract(r'NEX(\d+)', expand=False, flags=re.I)
    

    Output:

       ID              PROGRAM        out       out2    out3
    0   1           nbu 123456        NaN        NaN     NaN
    1   2           NBU-123456        NaN        NaN     NaN
    2   3        nex999999 b12  NEX999999  nex999999  999999
    3   4        NXE999999 123        NaN        NaN     NaN
    4   5  NBU123456 NEX999999  NEX999999  NEX999999  999999