Search code examples
pythonstringsubstring

Extract a substring containing all capital letters in the end of a string


I have text looking like this:

Transfer to account IBAN NL10FRGS000000 FAKE COMPANY LTD

or

Transfer to account IBAN NL10FRGS000000 FAKE-COMPANY 22 LTD

or

Transfer #1234 received IBAN 00000 JOHN SMITH

I would like to extract the company name from the string. It is always in capital letters and is either LTD or CO but sometimes it can be a person's name, again written in Capital letters at the end of the string. The name of the company may contain '-'.


Solution

  • You could try as follows:

    import re
    
    transfers = ['Transfer to account IBAN NL10FRGS000000 FAKE COMPANY LTD',
     'Transfer to account IBAN NL10FRGS000000 FAKE-COMPANY 22 LTD',
     'Transfer #1234 received IBAN 00000 JOHN SMITH']
    
    pattern = r'[A-Z]{2}[0-9]{2}[A-Z0-9]{1,30}\s(.*$)'
    
    # [A-Z]{2}[0-9]{2}[A-Z0-9]{1,30} will get any IBAN-like string, 
    # it's not necessarily a valid IBAN.
    
    company_list = list()
    
    for t in transfers:
        m = re.search(pattern, t)
        if m != None:
            company = m.group(1)
            company_list.append(company)
            
            # note that m.group(0).split(maxsplit=1) will get you the IBAN as well
            # e.g.: iban, company = m.group(0).split(maxsplit=1)
            # print(iban, company): NL10FRGS000000 FAKE COMPANY LTD
            
    company_list
    ['FAKE COMPANY LTD', 'FAKE-COMPANY 22 LTD']
    

    Note that the last entry doesn't return a match, since 00000 does not match the IBAN pattern.


    Update: "Since these transfers are in a pandas column is it possible to be done without for loop?" Yes, can be done. No need to import re in this case.

    import pandas as pd
    
    transfers = ['Transfer to account IBAN NL10FRGS000000 FAKE COMPANY LTD',
     'Transfer to account IBAN NL10FRGS000000 FAKE-COMPANY 22 LTD',
     'Transfer #1234 received IBAN 00000 JOHN SMITH']
    
    df = pd.DataFrame(transfers, columns=['Transfers'])
    
    pattern = r'[A-Z]{2}[0-9]{2}[A-Z0-9]{1,30}\s(.*$)'
    
    df['Company'] = df.Transfers.str.extract(pattern)
    
    print(df['Company'])
    
    0       FAKE COMPANY LTD
    1    FAKE-COMPANY 22 LTD
    2                    NaN
    Name: Company, dtype: object
    

    Or together with the IBAN:

    df = pd.DataFrame(transfers, columns=['Transfers'])
    
    # N.B. two capturing groups here in pattern
    pattern = r'([A-Z]{2}[0-9]{2}[A-Z0-9]{1,30})\s(.*$)'
    
    df[['IBAN', 'Company']] = df.Transfers.str.extract(pattern)
    
    print(df[['IBAN', 'Company']])
    
                 IBAN              Company
    0  NL10FRGS000000     FAKE COMPANY LTD
    1  NL10FRGS000000  FAKE-COMPANY 22 LTD
    2             NaN                  NaN