Search code examples
pythonmatchoption-typepython-re

Python RE, problem with optional match groups


My apologies if this has been asked before. I am parsing some law numbers from the California penal code so they can be run through an existing database to return a plain-language title of the law. For example:

PC 182(A)(1); PC 25400(A)(1); PC 25850(C)(6); PC 32310; VC 12500(A); VC 22517; VC 23103(A)

Each would be split at ';' and parsed into:

{'lawType': 'PC', 'lawNumber': '182', 'subsection': 'A', 'subsubsection': '1'} Returns: Conspiracy to commit a crime

Here's my RE search:

(?P<lawType>[A-Z]{2})[ ](?P<lawNumber>[0-9.]*[A-Z]?)\((?P<subsection>[A-Z])\)?\((?P<subsubsection>[0-9])\)?

Every law should have at least the type and number (i.e. PC 182), but sometimes they also have the subsection and subsubsection (i.e. (A)(1)). Those two subgroups need to be optional, but the search above isn't picking them up using the '?'. This code works, but I'd like to make it more compact with just one search:

lineValue = 'PC 182(A)(1); PC 25400(A)(1); PC 25850(C)(6); PC 32310; VC 12500(A); VC 22517; VC 23103(A)'
#lineValue = 'PC 148(A)(1); PC 369I; PC 587C; MC 8.80.060(F)'
chargeList = map(lambda x: x.strip(), lineValue.split(';'))
for thisCharge in chargeList:
    m = re.match(r'(?P<lawType>[A-Z]{2})[ ](?P<lawNumber>[0-9.]*[A-Z]?)\((?P<subsection>[A-Z])\)\((?P<subsubsection>[0-9])\)', thisCharge)
    if m:
        detail = m.groupdict()
        print(detail)

    else:
        m = re.match(r'(?P<lawType>[A-Z]{2})[ ](?P<lawNumber>[0-9.]*[A-Z]?)\((?P<subsection>[A-Z])\)', thisCharge)
        if m:
            detail = m.groupdict()
            print(detail)

        else:
            m = re.match(r'(?P<lawType>[A-Z]{2})[ ](?P<lawNumber>[0-9.]*[A-Z]?)', thisCharge)
            if m:
                detail = m.groupdict()
                print(detail)

            else:
                print('NO MATCH: ' + str(thisCharge))

I have three different searches, which shouldn't be necessary if the '?' optional group marker were working as expected. Can anyone offer a thought?


Solution

  • The problem is in how you are applying the ? to make each of the subsections optional. A ? applies to just the term immediately preceding it. In your case, this is just the closing parentheses for each subsection Because of this, you are requiring the opening parentheses and the number or letter unconditionally for each subsection term. To fix this, just wrap the complete subsection terms in an extra set of parentheses, and apply the ? to those groups. This code:

    import re
    
    data = "PC 182(A)(1); PC 25400(A)(1); PC 25850(C)(6); PC 32310; VC 12500(A); VC 22517; VC 23103(A)"
    
    exp = re.compile(r"(?P<lawType>[A-Z]{2})[ ](?P<lawNumber>[0-9.]*[A-Z]?)(?:\((?P<subsection>[A-Z])\))?(?:\((?P<subsubsection>[0-9])\))?")
    
    def main():
        r = exp.findall(data)
        print(r)
    
    main()
    

    produces:

    [('PC', '182', 'A', '1'), ('PC', '25400', 'A', '1'), ('PC', '25850', 'C', '6'), ('PC', '32310', '', ''), ('VC', '12500', 'A', ''), ('VC', '22517', '', ''), ('VC', '23103', 'A', '')]
    

    Here's an example of how to use your expression to pick out the information for each law individually, making use of your group labels:

    def main():
        p = 0
        while True:
            m = exp.search(data[p:])
            if not m:
                break
            print('Type:', m.group('lawType'))
            print('Number:', m.group('lawNumber'))
            if m.group('subsection'):
                print('Subsection:', m.group('subsection'))
            if m.group('subsubsection'):
                print('Subsubsection:', m.group('subsubsection'))
            print()
            p += m.end()
    

    Result

    Type: PC
    Number: 182
    Subsection: A
    Subsubsection: 1
    
    Type: PC
    Number: 25400
    Subsection: A
    Subsubsection: 1
    
    Type: PC
    Number: 25850
    Subsection: C
    Subsubsection: 6
    
    Type: PC
    Number: 32310
    
    Type: VC
    Number: 12500
    Subsection: A
    
    Type: VC
    Number: 22517
    
    Type: VC
    Number: 23103
    Subsection: A
    

    Noticing how you were pre-splitting your data before applying your regex, I thought you might want to see how I would process each term using only regex matching.