Search code examples
pythonregexregex-replace

Better way to use regular expression


I have a string of account information with multiple accounts in the string (the example shows one line, where I actually have a text file with multiple lines of account data, so there is another loop going through each line in the text file in my code). I need to pull out each account into its own line. The code below works, but I assume there is a more efficient or better way to do it. I am just starting to learn Regex.

import re
import pandas as pd

allAccounts = []
example = '02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73'  

rex = '[0-9]{1,2}-[0-9]{1,7}-[0-9]{1,2}'
accounts = re.findall(rex, example)
for account in accounts:       
    example= example.replace(account, f'||{account}')
       
example = [account.replace('  ', '|').split('|') for account in example.split('||')][1:]
allAccounts += example
df = pd.DataFrame(allAccounts)
df

from the regex portion of the code, I want it to return:

['02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42', ' 02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73'] 
# or 
'||02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  ||02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73'

The code returns the following df which is what ultimately I want:

                0       1             2     3          4         5      6      7       8
0   02-0015800-00   NAME1   100 SOME ST Active  3/8/2021    139.23  139.81  0.42    
1   02-0023901-01   NAME2   101 SOME ST Active  3/8/2021    512.33  482.96  -5.73   None

But I feel like there is a better way to use the Regex than what I am doing. Reading the docs it seems like re.sub should do it, but it is only replacing the first account number it comes upon, and it only want to replace the account number not add the '||' separator to the beginning.


update:

Using the following it gets close to what I want but not sure why the first item in the list is ''.

example = '02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73'  
rex = re.compile('(?=[0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9])')
re.split(rex, example)

outputs:

['',
 '02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  ',
 '02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73']

Solution

  • Instead of using split, you can match the values:

    \b\d\d-\d{7}-\d\d\b.*?(?=\s*\b\d\d-\d{7}-\d\d\b.*?|$)
    

    Explanation

    • \b\d\d-\d{7}-\d\d\b Match the pattern with 2 digits - 7 digits - 2 digits using a quantifier
    • .*? Match any character as least as possible
    • (?=\s*\b\d\d-\d{7}-\d\d\b.*?|$) Positive lookahead to assert either the digits pattern to the right or the end of the string to also match the last occurrence

    Regex demo

    Example

    import re
    
    pattern = r"\b\d\d-\d{7}-\d\d\b.*?(?=\s*\b\d\d-\d{7}-\d\d\b.*?|$)"
    
    s = "02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73"
    
    print(re.findall(pattern, s))
    

    Output

    ['02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42', '02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73']
    

    If you must use split:

    import re

    pattern = r"\b(?=\d\d-\d{7}-\d\d\b)"
    
    s = "02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73"
    
    result = [m.strip() for m in re.split(pattern, s) if m]
    print(result)
    

    See a Python demo