Search code examples
pythonpandasregexdataframedictionary

How to find specific word followed by number in a dataframe using Python


I have a dataframe which contains a series of patterns.

Example dataframe:

mydata: 
[ 'Ticket number INS 00909',
'Ticket number INS00909',
'Ticket number REQ 8776',
'Ticket number REQ#8777',
'Ticket number REQ #8778',
'Ticket number REQ8778',
'Number is CR 0098445554',
'No INS number',
'No REQUEST',
'DESCRIBED']

Search_Values = ['INS', 'REQ', 'CR' ]
pattern = '|'.join(Search_Values)
df['check'] = df['mydata'].str.contains(pattern)

Basically, I want to find and extract only the ticket numbers. What logic should be applied to pull the CR, REQ , INS followed by number? Sometimes the string would contain special characters like #, -, :, *, space(\s) in between Req and number.Can anyone please help?


Solution

  • If you do not have to distinguish between INS, RES and CR numbers, you can use this code:

    import re
    
    data = [ 'Ticket number INS 00909',
    'Ticket number INS00909',
    'Ticket number REQ 8776',
    'Ticket number REQ#8777',
    'Ticket number REQ #8778',
    'Ticket number REQ8778',
    'Number is CR 0098445554',
    'No INS number',
    'No REQUEST',
    'DESCRIBED']
    
    numbers = []
    
    for item in data:
        res = re.findall(r'[0-9]+$', item)
        if res != []:
            numbers.append(res[0])
    
    print(numbers)
    

    which gives a simple list with all the ticket numbers together:

    ['00909', '00909', '8776', '8777', '8778', '8778', '0098445554']
    

    Otherwise, if you need to distinguish the ticket types, use this:

    import re
    
    data = [ 'Ticket number INS 00909',
    'Ticket number INS00909',
    'Ticket number REQ 8776',
    'Ticket number REQ#8777',
    'Ticket number REQ #8778',
    'Ticket number REQ8778',
    'Number is CR 0098445554',
    'No INS number',
    'No REQUEST',
    'DESCRIBED']
    
    numbers = {'INS': [], 'REQ': [], 'CR': []}
    
    for item in data:
        res = re.findall(r'[0-9]+$', item)
        if res != []:
            if 'INS' in item:
                numbers['INS'].append(res[0])
            elif 'REQ' in item:
                numbers['REQ'].append(res[0])
            elif 'CR' in item:
                numbers['CR'].append(res[0])
    
    print(numbers)
    

    which gives a dictionary with a key for each ticket type:

    {'INS': ['00909', '00909'], 'REQ': ['8776', '8777', '8778', '8778'], 'CR': ['0098445554']}
    

    I kept the ticket numbers as str and I had not converted them into int in order to avoid this, perhaps unwanted, conversion: '00909' ---> 909.
    This solution is based on the assumption that the ticket number is always at the end of the string.