Search code examples
pythonregex

python regex, split string with multiple delimeters


I know this question has been answered but my use case is slightly different. I am trying to setup a regex pattern to split a few strings into a list.

Input Strings:

1. "ABC-QWERT01"
2. "ABC-QWERT01DV"
3. "ABCQWER01"

Criteria of the string ABC - QWERT 01 DV 1 2 3 4 5

  1. The string will always start with three chars
  2. The dash is optional
  3. there will then be 3-10 chars
  4. Left padded 0-99 digits
  5. the suffix is 2 chars and is optional

Expected Output

1. ['ABC','-','QWERT','01']
1. ['ABC','-','QWERT','01', 'DV']
1. ['ABC','QWER','01','DV']

I have tried the following patterns a bunch of different ways but I am missing something. My thought was start at the beginning of the string, split after the first three chars or the dash, then split on the occurrence of two decimals.

Pattern 1: r"([ -?, \d{2}])+" This works but doesn't break up the string by the first three chars if the dash is missing

Pattern 2: r"([^[a-z]{3}, -?, \d{2}])+" This fails as a non-pattern match, nothing gets split

Pattern 3: r"([^[a-z]{3}|-?, \d{2}])+" This fails as a non-pattern match, nothing gets split

Any tips or suggestions?


Solution

  • You can use a pattern similar to :

    (?i)([A-Z]{3})(-?)([A-Z]*)([0-9]{2})([A-Z]*)
    

    Code:

    import re
    
    
    def _parts(s):
        p = r'(?i)([A-Z]{3})(-?)([A-Z]*)([0-9]{2})([A-Z]*)'
        return re.findall(p, s)
    
    
    print(_parts('ABC-QWERT01DV'))
    print(_parts('ABCQWER01'))
    print(_parts('ABC-QWERT01'))
    
    

    Prints

    [('ABC', '-', 'QWERT', '01', 'DV')]
    [('ABC', '', 'QWER', '01', '')]
    [('ABC', '-', 'QWERT', '01', '')]
    

    Notes:

    • (?i): insensitive flag.
    • ([A-Z]{3}): capture group 1 with any 3 letters.
    • (-?): capture group 2 with an optional dash.
    • ([A-Z]*): capture group 3 with 0 or more letters.
    • ([0-9]{2}): capture group 4 with 2 digits.
    • ([A-Z]*): capture group 5 with 0 or more letters.