Search code examples
pythonregexlist-comprehensionpython-module

How can I find list comprehensions in Python code?


I am trying to refactor some Python modules which contain complex list comprehensions than can be single or multiline. An example of such a list comprehension is:

some_list = [y(x) for x in some_complex_expression if x != 2]

I tried to use the following regex pattern in PyCharm but this matches simple lists as well:

\[.+\]

Is there a way to not match simple lists and perhaps also match list comprehensions that are multiline? I am okay with solutions other than regex as well.


Solution

  • Regex is not designed to handle a structured syntax. You are almost certain to always be able to find corner cases that your deliberately written regex is unable to handle, as suggested by the comments above.

    A proper Python parser should be used instead to identify list comprehensions per the language specifications. Fortunately, Python has included a comprehensive set of modules that help parse and navigate through Python code in various ways.

    In your case, you can use the ast module to parse the code into an abstract syntax tree, walk through the AST with ast.walk, identify list comprehensions by the ListComp nodes, and output the lines of those nodes along with their line numbers.

    Since list comprehensions can be nested, you'd want to avoid outputting the inner list comprehensions when the outer ones are already printed. This can be done by keeping track of the last line number sent to the output and only printing line numbers greater than the last line number.

    For example, with the following code:

    import ast
    
    with open('file.py') as file:
        lines = file.readlines()
    
    last_lineno = 0
    for node in ast.walk(ast.parse(''.join(lines))):
        if isinstance(node, ast.ListComp):
            for lineno in range(node.lineno, node.end_lineno + 1):
                if lineno > last_lineno:
                    print(lineno, lines[lineno - 1], sep='\t', end='')
                    last_lineno = lineno
            print()
    

    and the following content of file.py:

    a = [(i + 1) * 2 for i in range(3)]
    b = '[(i + 1) * 2 for i in range(3)]'
    c = [
        i * 2
        for i in range(3)
        if i
    ]
    # d = [(i + 1) * 2 for i in range(3)]
    e = [
        [(i + 1) * 2 for i in range(j)]
        for j in range(3)
    ]
    

    the code would output:

    1   a = [(i + 1) * 2 for i in range(3)]
    
    3   c = [
    4       i * 2
    5       for i in range(3)
    6       if i
    7   ]
    
    9   e = [
    10      [(i + 1) * 2 for i in range(j)]
    11      for j in range(3)
    12  ]
    

    because b is assigned a string, and the assignment of d is commented out.

    Demo: https://replit.com/@blhsing/StimulatingCrimsonProgramminglanguage#main.py