Search code examples
pythonlatexdelimiterbrackets

How to automate the left and right pairing in LaTex Equation using Python


Problem

In Latex, delimiters like (…), […], and {…} can increase in size depending on the size of equation within by adding a \left and \right before the opening- and closing-delimiter, respectively; such as \left( <equation> \right).

However, this should come in pairs. This means that whenever you introduce \left it should have a pair of \right, otherwise this is error. But there are some that it only needs the opening or closing delimiter and this can be solved by adding \left. (\left dot) or \right. (\right dot) in-place for the missing pair, such as \left(<equation>\right. or \left.<equation>\right)

Question: How can I automatically insert the missing pair?

Example Input:

\begin{align}
    \left( content & content \right) content \left( content \left( content \right) \\
    content \right) \left(content \left( content \\
    content \right)
\end{align}

\begin{align}
    \left( content & content \right) content \left( content \left( content \right) \nonumber \\
    content \right) \left( content \left( content \nonumber \\
    content \right)
\end{align}

Output should be:

\begin{align}
    \left( content [\right.] & [\left.] content \right) content \left( content \left( content \right) [\right.] \\
    [\left.] content \right) \left( content \left( content [\right.]  [\right.] \\
    [\left.] content \right)
\end{align}

\begin{align}
    \left( content [\right.] & [\left.] content \right) content \left( content \left( content \right) [\right.] \nonumber \\
    [\left.] content \right) \left( content \left( content [\right.] [\right.] \nonumber \\
    [\left.] content \right)
\end{align}

The ones between the square-brackets should be automatically generated (without the square brackets).

If not paired,

  • add \right. before the end of &, \, or \end{} whichever is applicable based on the above inclusions. The number of \right. must be the number of \left without a pair. If there is a \nonumber before the end of the inclusion, add the \right. before the \nonumber tag.
  • add \left. at the start after &, \ or \begin{} whichever is applicable based on the above inclusions. The number of \left. must be the number of \right without a pair.

Solution

  • It might be better to approach this problem more generally with the aid of a proper LaTeX parser. However, if you're looking to tackle this specific problem in Python as you've stated it, below is some code that will do the job.

    For the code to work out of the box, you'll only need to replace the contents of the snippet variable with your string of interest.

    The code assumes that you are trying to balance single- or multi-line equations within align blocks, and that your snippet is an uninterrupted (other than by whitespace) series of such blocks, as in the example. You should be okay with whitespace inside your equations being stripped and rearranged.

    import re
    
    snippet: str = r"""
    \begin{align}
        \left( content & content \right) content \left( content \left( content \right) \\
        content \right) \left(content \left( content \\
        content \right)
    \end{align}
    
    \begin{align}
        \left( content & content \right) content \left( content \left( content \right) \nonumber \\
        content \right) \left( content \left( content \nonumber \\
        content \right)
    \end{align}
    """
    
    # regex to capture stuff within the align blocks
    re_align = re.compile(r'\\begin\{align\}(.*?)\\end\{align\}', flags=re.DOTALL)
    
    # left bracket patterns
    re_parens_left = re.compile(r'\\left\(', flags=re.DOTALL)
    re_braces_left = re.compile(r'\\left\\\{', flags=re.DOTALL)
    re_square_left = re.compile(r'\\left\[', flags=re.DOTALL)
    # right bracket patterns
    re_parens_right = re.compile(r'\\right\)', flags=re.DOTALL)
    re_braces_right = re.compile(r'\\right\\\}', flags=re.DOTALL)
    re_square_right = re.compile(r'\\right\]', flags=re.DOTALL)
    
    re_break = re.compile(r'[\s]*\\\\[\s]*', flags=re.DOTALL)
    re_nonum = re.compile(r'\\nonumber', flags=re.DOTALL)
    
    # function that does the balancing for a column string; invoked by main loop below
    from collections import deque
    def balance(string: str, re_left: re.Pattern, re_right: re.Pattern) -> str:
        """
            for a given bracket type, identify all occurrences of the current bracket,
                and balance them using the standard stack-based algorithm; Python collections'
                'deque' data structure serves the purpose of a stack here.
        """
        re_either = re.compile(re_left.pattern + '|' + re_right.pattern, flags=re.DOTALL)
        match_list = deque(re_either.findall(string))
    
        if len(match_list) == 0:
            return string # early exit if no brackets => no balancing needed
    
        balance_stack = deque()
        for item in match_list:
            if re_left.match(item): current_bracket = 'l'
            elif re_right.match(item): current_bracket = 'r'
            else: raise ValueError(f"got problematic bracket '{item}' in 'balance'")
    
            previous_bracket = balance_stack[-1] if len(balance_stack) > 0 else None
    
            if (previous_bracket == 'l') and (current_bracket == 'r'):
                balance_stack.pop()
            else:
                balance_stack.append(current_bracket)
    
        # whatever's left on the stack is the imbalance
        remaining = ''.join(balance_stack)
        imbalance_left = remaining.count('l')
        imbalance_right = remaining.count('r')
    
        balance_string_left = ' ' + ' '.join([r'\right.'] * imbalance_left) if imbalance_left > 0 else ''
        balance_string_right = ' '.join([r'\left.'] * imbalance_right) + ' ' if imbalance_right > 0 else ''
    
        nonum_match = False if re_nonum.search(string) is None else True
        result = re_nonum.sub('', string)
        nonum_string = ' \\nonumber ' if nonum_match else ''
        result = balance_string_right + result + balance_string_left + nonum_string
        return result
    
    # main loop
    result_equations = []
    for equation in re_align.findall(snippet):
        lines = re_break.split(equation.strip()) # split on double backslash
        result_lines = []
        for line in lines:
            columns = line.strip().split('&')
            result_columns = []
            for column in columns:
                # balance brackets using the stack algorithm
                result_column = column.strip()
                # for each type of bracket () or \{\} or [], return the balanced string 
                result_column = balance(result_column, re_parens_left, re_parens_right)
                result_column = balance(result_column, re_braces_left, re_braces_right)
                result_column = balance(result_column, re_square_left, re_square_right)
                
                result_columns.append(result_column)
            
            result_line = ' & '.join(result_columns)
            result_lines.append(result_line)
    
        result_equation = '\\begin{align}\n    ' + ' \\\\\n    '.join(result_lines) + '\n\\end{align}'
        result_equations.append(result_equation)
    
    result = '\n\n'.join(result_equations)
    
    print(result)
    

    How the code works

    The code relies on Python's re (regular expressions) library to identify patterns of interest. The first part of the code compiles the bracket and other patterns that we expect to work with.

    Next comes the main loop -- the input string snippet is broken down hierarchically here: first by align equation block, then by line \\ within equation, and finally by column (delimited by &) within line.

    For each column, the code balances brackets using the standard stack-based algorithm; this is done in the balance function, once for each type of bracket. An adjustment for the presence of \nonumber is made.

    The code then joins back the balanced columns, lines and equations to synthesize the final result.

    Limitations

    The code is a bit cumbersome but solves the problem as you've stated it, making reasonable simplifying assumptions whenever your specification has the potential to be problematic. Cases where this will fail (not exhaustive):

    \begin{align}
        \left( content & content \\ % comment: the wandering explorer turned \left(
        content \textup{sneaked in a \\left( payload}
    \end{align}
    

    Identifying comments and highly nested syntax with strange edge cases isn't in the scope of this code. I'd recommend staying vigilant if you plan to use this for anything material.