Search code examples
pythonregexpython-recurly-braces

How to match nested LaTeX macros with re in Python?


I wanted to match LaTeX macros correctly even the nested ones. See the following:

s = r'''
firstline
\lr{secondline\rl{ right-to-left
        \lr{nested left-to-right} end RTL }
        other text
}
\rl{ last \lr{end line 
} end RTL }
'''

For instance, in the above, I want to match the \lr macro with its content. I have tried the following but none of them worked correctly:

re.findall(r'(?:\\lr\{.*\})', s, re.DOTALL)
['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}\n\\rl{ last \\lr{end line \n} end RTL }']

even non-greedy version did not work in this case:

re.findall(r'(?:\\lr\{.*?\})', s, re.DOTALL)
['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right}',
 '\\lr{end line \n}']

I need some regular expression to match it correctly, similar to nested parentheses, here I have nested curly brackets for LaTeX macros.

edit:

I'd like to get the following matches:

['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}', 
'\\lr{nested left-to-right}',
'\\lr{end line \n}']

It would be perfect if I knew about the level of nesting, something like the below:

[('\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}',1) 
('\\lr{nested left-to-right}',2)
('\\lr{end line \n}',1)]

Solution

  • With PyPi regex module (after installing it with pip install regex) you can use

    import regex
    
    s = r'''
    firstline
    \lr{secondline\rl{ right-to-left
            \lr{nested left-to-right} end RTL }
            other text
    }
    \rl{ last \lr{end line 
    } end RTL }
    '''
    
    print( [x.group() for x in regex.finditer(r'\\lr(\{(?:[^{}]++|(?1))*})', s, overlapped=True)] )
    # => ['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}', '\\lr{nested left-to-right}', '\\lr{end line \n}']
    

    See the Python demo and the regex demo.

    Note also the overlapped=True option used with regex.finditer that allows matching nested occurrences.

    Details:

    • \\lr - \lr string
    • (\{(?:[^{}]++|(?1))*}) - Group 1 (defined to be referred to while recursing):
      • \{ - a { char
      • (?:[^{}]++|(?1))* - zero or more repetitions of
      • [^{}]++ - one or more chars other than { and } without the possibity to re-match the text again in case backtracking is triggered (i.e. it is matched possessively)
      • | - or
      • (?1) - Group 1 pattern recursed
      • } - a } char.