Search code examples
pythonregexescaping

Complex conditional Regex grouping


I have the following text:

 1. The SBI considered this sub-item at its resumed 1^st^ meeting and at its 2^nd^ meeting (see
    para. 46 above). It had before it documents FCCC/CP/2011/7 and Corr.1 and Add.1 and 2, 
    FCCC/SBI/2010/17, FCCC/SBI/2010/26 and FCCC/SBI/2010/MISC.9. A statement was made on behalf of the LDCs.

I'm interested in extracting names of Documents that follow the pattern: (FCCC\/(?:SBSTA|SBI|CP|KP\/CMP|PA\/CMA)\/[0-9]{4}\/(?:INF\.|L\.|MISC\.)?[0-9]+(?:\/Add\.[0-9])?(?:\/Rev\.[0-9]+)?) this is, FCCC/Document_type(SBSTA or SBI, etc.)/Year/Number and they may or may not have adds, corrections and revisions.

There are two ways to refer to adds or revisions:

  • Add them at the end of the name: adding /Rev or /Add + number
  • or and Rev|Add|Corr .num

Then from the text I'm interested into build the names that are referenced with the second option. For example, map: FCCC/CP/2011/7 and Corr.1 and Add.1 and 2 to ["FCCC/CP/2011/7", "FCCC/CP/2011/7/Corr.1", "FCCC/CP/2011/7/Add.1", "FCCC/CP/2011/7/Add.2"].

This is my current approach:

def _find_documents(par: str) -> Union[list, None]:
    """
    Finds referenced documents
    :param par:
    :return:
    """
    found_list = []
    pattern = r"(FCCC\/(?:SBSTA|SBI|CP|KP\/CMP|PA\/CMA)\/[0-9]{4}\/(?:INF\.|L\.|MISC\.)?[0-9]+(?:\/Add\.[0-9])?(?:\/Rev\.[0-9]+)?)"
    found = re.findall(pattern, par)

    # Now, we look for corrections and Revisions
    for doc in found:
        found_list.append(doc)
        doc = doc.replace(r"/", r"\/")
        pattern = doc + r"(?: and ((:?Corr\.|Add\.)?[0-9]))?(?: and ((:?Corr\.|Add\.)[0-9]))*(:? and ([0-9])+)?"
        res = re.search(pattern, par).groups()
        for pat in res:
            if pat is not None:
                found_list.append(doc + "/" + pat)

    return found_list if found_list is not None else None

    st = r"""
    50. The SBI considered this sub-item at its resumed 1^st^ meeting and at its 2^nd^ meeting (see para. 46 above). It had before it documents FCCC/CP/2011/7 and Corr.1 and Add.1 and 2, FCCC/SBI/2010/17, FCCC/SBI/2010/26 and FCCC/SBI/2010/MISC.9. A statement was made on behalf of the LDCs.
    """
    
    _find_documents(st)
    """ [OUT]: 
    ['FCCC/CP/2011/7',
     'FCCC\\/CP\\/2011\\/7/Corr.1',
    'FCCC\\/CP\\/2011\\/7/Corr.',  EXTRA
    'FCCC\\/CP\\/2011\\/7/Add.1',
    'FCCC\\/CP\\/2011\\/7/Add.',  EXTRA
    'FCCC\\/CP\\/2011\\/7/ and 2',  EXTRA
    'FCCC\\/CP\\/2011\\/7/2', WRONG Should be FCCC/CP/2011/7/Add.2
    'FCCC/SBI/2010/17',
    'FCCC/SBI/2010/26',
    'FCCC/SBI/2010/MISC.9']"""
        

As you can see I have several problems that I don't know how to solve.

  1. The group capture extra matches ["Add.", "Corr.", "and 2"]
  2. When I try to append Corrs, Apps, the / is somehow escaped.
  3. Not sure how to map submatch and 2 into /Add.2 or /Corr.2 depending on the previous

Any ideas?

Thanks,


Solution

  • You can use

    import re
    text = "50. The SBI considered this sub-item at its resumed 1^st^ meeting and at its 2^nd^ meeting (see para. 46 above). It had before it documents FCCC/CP/2011/7 and Corr.1 and Add.1 and 2, FCCC/SBI/2010/17, FCCC/SBI/2010/26 and FCCC/SBI/2010/MISC.9. A statement was made on behalf of the LDCs."
    rx_main = re.compile(r'(FCCC/(?:SBSTA|SBI|CP|KP/CMP|PA/CMA)/\d{4}/(?:INF\.|L\.|MISC\.)?\d+)((?:(?:/|\s+and\s+|\s*,\s*)(?:Add|Rev|Corr)\.\d+(?:(?:\s*,\s*|\s+and\s+)\d+)*)*)')
    rx_rev = re.compile(r'(?:Add|Rev|Corr)\.\d+(?:(?:\s*,\s*|\s+and\s+)\d+)*')
    rx_split = re.compile(r'\s*,\s*|\s+and\s+')
    matches = rx_main.finditer(text)
    results = []
    for m in matches:
        results.append(m.group(1))
        chunks = [rx_split.split(x) for x in rx_rev.findall(m.group(2))]
        for ch in chunks:
            if len(ch) == 1: # it is simple, just add first item to Group 1
                results.append(f"{m.group(1)}/{ch[0]}")
            else:
                name = ch[0].split('.')[0] # Rev, Corr or Add
                for c in ch:
                    if '.' in c: # if there is a dot, append whole string to Group 1
                        results.append(f"{m.group(1)}/{c}")
                    else:
                        results.append(f"{m.group(1)}/{name}.{c}") # Append the new number to Corr/Add/Rev
    
    print(results)
    

    Output:

    ['FCCC/CP/2011/7', 'FCCC/CP/2011/7/Corr.1', 'FCCC/CP/2011/7/Add.1', 'FCCC/CP/2011/7/Add.2', 'FCCC/SBI/2010/17', 'FCCC/SBI/2010/26', 'FCCC/SBI/2010/MISC.9']
    

    See this Python demo.

    The new regex is

    (FCCC/(?:SBSTA|SBI|CP|KP/CMP|PA/CMA)/\d{4}/(?:INF\.|L\.|MISC\.)?\d+)((?:(?:/|\s+and\s+|\s*,\s*)(?:Add|Rev|Corr)\.\d+(?:(?:\s*,\s*|\s+and\s+)\d+)*)*)
    

    See the regex demo.