Search code examples
pythonparsingsyntaxabstract-syntax-treeconcrete-syntax-tree

Python comment-preserving parsing using only builtin libraries?


I wrote a library using just ast and inspect libraries to parse and emit [uses astor on Python < 3.9] internal Python constructs.

Just realised that I really need to preserve comments afterall. Preferably without resorting to a RedBaron or LibCST; as I just need to emit the unaltered commentary; is there a clean and concise way of comment-preserving parsing/emitting Python source with just stdlib?


Solution

  • What I ended up doing was writing a simple parser, without a meta-language in 339 source lines: https://github.com/offscale/cdd-python/blob/master/cdd/cst_utils.py

    Implementation of Concrete Syntax Tree [List!]

    1. Reads source character by character;
    2. Once end of statement† is detected, add statement-type into 1D list;
      • †end of line if line.lstrip().startswith("#") or line not endswith('\\') and balanced_parens(line) else continue munching until that condition is true… plus some edge-cases around multiline strings and the like;
    3. Once finished there is a big (1D) list where each element is a namedtuple with a value property.

    Integration with builtin Abstract Syntax Tree ast

    1. Limit ast nodes to modify—not remove—to: {ClassDef,AsyncFunctionDef,FunctionDef} docstring (first body element Constant|Str), Assign and AnnAssign;
    2. cst_idx, cst_node = find_cst_at_ast(cst_list, _node);
    3. if doc_str node then maybe_replace_doc_str_in_function_or_class(_node, cst_idx, cst_list)
    4. Now the cst_list contains only changes to those aforementioned nodes, and only when that change is more than whitespace, and can be created into a string with "".join(map(attrgetter("value"), cst_list)) for outputting to eval or straight out to a source file (e.g., in-place overriding).

    Quality control

    1. 100% test coverage
    2. 100% doc coverage
    3. Support for last 6 versions of Python (including latest alpha)
    4. CI/CD
    5. (Apache-2.0 OR MIT) licensed

    Limitations

    1. Lack of meta-language, specifically lack of using Python's provided grammar means new syntax elements won't automatically be supported (match/case is supported, but if there's new syntax introduced since, it isn't [yet?] supported… at least not automatically);
    2. Not builtin to stdlib so stdlib could break compatibility;
    3. Deleting nodes is [probably] not supported;
    4. Nodes can be incorrectly identified if there are shadow variables or similar issues that linters should point out.