Search code examples
pythonpython-3.xtoken

How exactly a DEDENT token is generated in python?


I am reading a documentation about lexical analysis of python,which described the course how INDENT and DEDENT tokens are generated.I post the description here.

The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens, using a stack, as follows.

Before the first line of the file is read, a single zero is pushed on the stack; this will never be popped off again. The numbers pushed on the stack will always be strictly increasing from bottom to top. At the beginning of each logical line, the line’s indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and one INDENT token is generated. If it is smaller, it must be one of the numbers occurring on the stack; all numbers on the stack that are larger are popped off, and for each number popped off a DEDENT token is generated. At the end of the file, a DEDENT token is generated for each number remaining on the stack that is larger than zero.

I've tried to understand the DEDENT section,but failed to,could somebody give a better explanation than the referenced?


Solution

  • As Python sometimes is easier than English, here is a rough translation of this description to Python. You can see real-world parser (written by myself) that works like this here.

    import re
    code = """
    for i in range(10):
       if i % 2 == 0:
         print(i)
       print("Next number")
    print("That's all")
    
    for i in range(10):
       if i % 2 == 0:
           print(i)
    print("That's all again)
    
    for i in range(10):
       if i % 2 == 0:
          print(i)
      print("That's all")
    """
    def get_indent(s) -> int:
        m = re.match(r' *', s)
        return len(m.group(0))
    def add_token(token):
        print(token)
    INDENT="indent"
    DEDENT="dedent"
    indent_stack = [0]
    # Before the first line of the file is read, a single zero is pushed on the stack
    for line in code.splitlines():
        print("processing line:", line)
        indent = get_indent(line)
        # At the beginning of each logical line, the line’s 
        # indentation level is compared to the top of the stack. 
        if indent > indent_stack[-1]:
            # If it is larger, it is pushed on the stack, 
            # and one INDENT token is generated.
            add_token(INDENT)
            indent_stack.append(indent)
        elif indent < indent_stack[-1]:
            while indent < indent_stack[-1]:
                #  If it is smaller, ...
                # all numbers on the stack that are larger are popped off,
                # and for each number popped off a DEDENT token is generated.
                add_token(DEDENT)
                indent_stack.pop()
            if indent != indent_stack[-1]:
                # it must be one of the numbers occurring on the stack; 
                raise IndentationError
    while indent_stack[-1]>0:
         # At the end of the file, a DEDENT token is generated for each number 
         # remaining on the stack that is larger than zero.
         add_token(DEDENT)
         indent_stack.pop()
    

    Here is the output:

    processing line: 
    processing line: for i in range(10):
    processing line:    if i % 2 == 0:
    indent
    processing line:      print(i)
    indent
    processing line:    print("Next number")
    dedent
    processing line: print("That's all")
    dedent
    processing line: 
    processing line: for i in range(10):
    processing line:    if i % 2 == 0:
    indent
    processing line:        print(i)
    indent
    processing line: print("That's all again)
    dedent
    dedent
    processing line: 
    processing line: for i in range(10):
    processing line:    if i % 2 == 0:
    indent
    processing line:       print(i)
    indent
    processing line:   print("That's all")
    dedent
    dedent
      File "<string>", line unknown
    IndentationError