Search code examples
pythonpython-3.xtokenize

Extract all `INDENT` tokens using python's tokenize


I am trying to use the tokenize library in python to tokenize python code. For a sample input :-

def cal_cone_curved_surf_area(slant_height,radius):\n\tpi=3.14\n\treturn pi*radius*slant_height\n\n

I am using the following code to get all the tokens (here p is the sample input string):

text = tokenize.generate_tokens(io.StringIO(p).readline)
[tok for tok in text]

Upon running the code snippet, I get the following output:

[TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
TokenInfo(type=1 (NAME), string='cal_cone_curved_surf_area', start=(1, 4), end=(1, 29), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
TokenInfo(type=53 (OP), string='(', start=(1, 29), end=(1, 30), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=1 (NAME), string='slant_height', start=(1, 30), end=(1, 42), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=53 (OP), string=',', start=(1, 42), end=(1, 43), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=1 (NAME), string='radius', start=(1, 43), end=(1, 49), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=53 (OP), string=')', start=(1, 49), end=(1, 50), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=53 (OP), string=':', start=(1, 50), end=(1, 51), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 51), end=(1, 52), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=5 (INDENT), string='\t', start=(2, 0), end=(2, 1), line='\tpi=3.14\n'),
 TokenInfo(type=1 (NAME), string='pi', start=(2, 1), end=(2, 3), line='\tpi=3.14\n'),
 TokenInfo(type=53 (OP), string='=', start=(2, 3), end=(2, 4), line='\tpi=3.14\n'),
 TokenInfo(type=2 (NUMBER), string='3.14', start=(2, 4), end=(2, 8), line='\tpi=3.14\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 8), end=(2, 9), line='\tpi=3.14\n'),
 TokenInfo(type=1 (NAME), string='return', start=(3, 1), end=(3, 7), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=1 (NAME), string='pi', start=(3, 8), end=(3, 10), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=53 (OP), string='*', start=(3, 10), end=(3, 11), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=1 (NAME), string='radius', start=(3, 11), end=(3, 17), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=53 (OP), string='*', start=(3, 17), end=(3, 18), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=1 (NAME), string='slant_height', start=(3, 18), end=(3, 30), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 30), end=(3, 31), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=56 (NL), string='\n', start=(4, 0), end=(4, 1), line='\n'),
  TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line=''),
  TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')]

As can be seen, I am only able to extract one INDENT token (line number 10), but not the second after the second NEWLINE. How do I make sure that I get all the correct INDENT tokens in my source code?


Solution

  • Token INDENT is generated upon entering a block, not for each line. Upon exiting a block, generate_tokens() generates token DEDENT. All tokens from INDENT to the next INDENT or to the matching DEDENT have the same indentation level.