python parsing abstract-syntax-tree interpreter

Count the number of tokens/expressions in a Python program

There exist many tools to count the source lines of code in a program. I currently use cloc. I often use this as a proxy to measure complexity of a project I'm working on, and occasionally spend a few weeks trying to minimize this measure. However, it's not ideal, because it's affected by things like the length of variable names.

Is there an easy way, maybe by leveraging bits of the python interpreter/AST parser itself, to count the number of distinct tokens in a Python program? For example:

grammar = grammar_path.read_text(encoding="UTF-8")

this line would have maybe 6 tokens, if we count the second argument to getattr() and then assignment operator.

I'm hoping there's an implementation of this somewhere, and I just don't know what to google to find it. It would also be helpful to know if there are any existing tools for doing this in other languages.

Solution

The line grammar = grammar_path.read_text(encoding="UTF-8") has ten tokens, or eleven if you count the NEWLINE token at the end of the line. You can easily see that, using the generate_tokens method from built-in tokenize standard library module. (Although I use v3.11 in the examples below, the tokenize model has been available since v2.2. There have been changes to the details of the produced tokens, though.)

Note that the generate_token method expects its argument to be a function which iterates over input lines. For a simple demonstration, I just used sys.stdin.readline, which reads successive lines from stdin. A more normal usage would be to supply the readline method for a file open for reading. I used enumerate in the example in order to number the successive tokens.

$ python3.11
Python 3.11.0 (main, Oct 24 2022, 19:56:01) [GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tokenize
>>> import sys
>>> for i, token in enumerate(tokenize.generate_tokens(sys.stdin.readline), start=1):
...   print(f"""{i:3}: {token}""")
... 
grammar = grammar_path.read_text(encoding="UTF-8")
  1: TokenInfo(type=1 (NAME), string='grammar', start=(1, 0), end=(1, 7), line='grammar = grammar_path.read_text(encoding="UTF-8")\n')
  2: TokenInfo(type=54 (OP), string='=', start=(1, 8), end=(1, 9), line='grammar = grammar_path.read_text(encoding="UTF-8")\n')
  3: TokenInfo(type=1 (NAME), string='grammar_path', start=(1, 10), end=(1, 22), line='grammar = grammar_path.read_text(encoding="UTF-8")\n')
  4: TokenInfo(type=54 (OP), string='.', start=(1, 22), end=(1, 23), line='grammar = grammar_path.read_text(encoding="UTF-8")\n')
  5: TokenInfo(type=1 (NAME), string='read_text', start=(1, 23), end=(1, 32), line='grammar = grammar_path.read_text(encoding="UTF-8")\n')
  6: TokenInfo(type=54 (OP), string='(', start=(1, 32), end=(1, 33), line='grammar = grammar_path.read_text(encoding="UTF-8")\n')
  7: TokenInfo(type=1 (NAME), string='encoding', start=(1, 33), end=(1, 41), line='grammar = grammar_path.read_text(encoding="UTF-8")\n')
  8: TokenInfo(type=54 (OP), string='=', start=(1, 41), end=(1, 42), line='grammar = grammar_path.read_text(encoding="UTF-8")\n')
  9: TokenInfo(type=3 (STRING), string='"UTF-8"', start=(1, 42), end=(1, 49), line='grammar = grammar_path.read_text(encoding="UTF-8")\n')
 10: TokenInfo(type=54 (OP), string=')', start=(1, 49), end=(1, 50), line='grammar = grammar_path.read_text(encoding="UTF-8")\n')
 11: TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 50), end=(1, 51), line='grammar = grammar_path.read_text(encoding="UTF-8")\n')

At this point, the loop is waiting for the next line; in order to terminate the loop, I need to type an end-of-input marker (Control-D on Unix; Control-Z on Windows; in both cases, followed by Enter). The tokenizer will then return a final ENDMARKER token:

 12: TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

As explained in the docs, you can also use the standard library module as a command-line utility to list tokens. Again, I had to terminate the input by typing the end-of-input marker, after which the last line was printed:

$ python3.11 -m tokenize
grammar = grammar_path.read_text(encoding="UTF-8")                                   
1,0-1,7:            NAME           'grammar'      
1,8-1,9:            OP             '='            
1,10-1,22:          NAME           'grammar_path' 
1,22-1,23:          OP             '.'            
1,23-1,32:          NAME           'read_text'    
1,32-1,33:          OP             '('            
1,33-1,41:          NAME           'encoding'     
1,41-1,42:          OP             '='            
1,42-1,49:          STRING         '"UTF-8"'      
1,49-1,50:          OP             ')'            
1,50-1,51:          NEWLINE        '\n'           
2,0-2,0:            ENDMARKER      ''