Search code examples
pythontokenize

How to split Python source code into tokens?


I have source code like:

model_stateless.fit(x_train,
                    y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test),
                    shuffle=False)

print('Predicting')
predicted_stateless = model_stateless.predict(x_test, batch_size=batch_size)

or

def create_model(stateful):
    model = Sequential()
    model.add(LSTM(20,
                   input_shape=(lahead, 1),
                   batch_size=batch_size,
                   stateful=stateful))
    model.add(Dense(1))
    model.compile(loss='mse', optimizer='adam')
    return model

And I want to tokenize this. If I do a straight filecontents.split(' '), then it groups commas, parentheses, etc.

Ideally, I want to tokenize this to have: model_stateless.fit, (, x_train, , y_train, ,, batch_size=batch_size, ,, etc


Solution

  • Use the tokenize module:

    ❯ python3 -m tokenize -e file.py
    0,0-0,0:            ENCODING       'utf-8'        
    1,0-1,3:            NAME           'def'          
    1,4-1,16:           NAME           'create_model' 
    1,16-1,17:          LPAR           '('            
    1,17-1,25:          NAME           'stateful'     
    1,25-1,26:          RPAR           ')'            
    1,26-1,27:          COLON          ':'            
    1,27-1,28:          NEWLINE        '\n'           
    2,0-2,4:            INDENT         '    '         
    2,4-2,9:            NAME           'model'        
    2,10-2,11:          EQUAL          '='            
    2,12-2,22:          NAME           'Sequential'   
    2,22-2,23:          LPAR           '('            
    2,23-2,24:          RPAR           ')'            
    2,24-2,25:          NEWLINE        '\n'           
    3,4-3,9:            NAME           'model'        
    3,9-3,10:           DOT            '.'            
    3,10-3,13:          NAME           'add'          
    3,13-3,14:          LPAR           '('            
    3,14-3,18:          NAME           'LSTM'         
    3,18-3,19:          LPAR           '('            
    3,19-3,21:          NUMBER         '20'           
    3,21-3,22:          COMMA          ','            
    3,22-3,23:          NL             '\n'           
    4,19-4,30:          NAME           'input_shape'  
    4,30-4,31:          EQUAL          '='            
    4,31-4,32:          LPAR           '('            
    4,32-4,38:          NAME           'lahead'       
    4,38-4,39:          COMMA          ','            
    4,40-4,41:          NUMBER         '1'            
    4,41-4,42:          RPAR           ')'            
    4,42-4,43:          COMMA          ','            
    4,43-4,44:          NL             '\n'           
    5,19-5,29:          NAME           'batch_size'   
    5,29-5,30:          EQUAL          '='            
    5,30-5,40:          NAME           'batch_size'   
    5,40-5,41:          COMMA          ','            
    5,41-5,42:          NL             '\n'           
    6,19-6,27:          NAME           'stateful'     
    6,27-6,28:          EQUAL          '='            
    6,28-6,36:          NAME           'stateful'     
    6,36-6,37:          RPAR           ')'            
    6,37-6,38:          RPAR           ')'            
    6,38-6,39:          NEWLINE        '\n'           
    7,4-7,9:            NAME           'model'        
    7,9-7,10:           DOT            '.'            
    7,10-7,13:          NAME           'add'          
    7,13-7,14:          LPAR           '('            
    7,14-7,19:          NAME           'Dense'        
    7,19-7,20:          LPAR           '('            
    7,20-7,21:          NUMBER         '1'            
    7,21-7,22:          RPAR           ')'            
    7,22-7,23:          RPAR           ')'            
    7,23-7,24:          NEWLINE        '\n'           
    8,4-8,9:            NAME           'model'        
    8,9-8,10:           DOT            '.'            
    8,10-8,17:          NAME           'compile'      
    8,17-8,18:          LPAR           '('            
    8,18-8,22:          NAME           'loss'         
    8,22-8,23:          EQUAL          '='            
    8,23-8,28:          STRING         "'mse'"        
    8,28-8,29:          COMMA          ','            
    8,30-8,39:          NAME           'optimizer'    
    8,39-8,40:          EQUAL          '='            
    8,40-8,46:          STRING         "'adam'"       
    8,46-8,47:          RPAR           ')'            
    8,47-8,48:          NEWLINE        '\n'           
    9,4-9,10:           NAME           'return'       
    9,11-9,16:          NAME           'model'        
    9,16-9,17:          NEWLINE        '\n'           
    10,0-10,0:          DEDENT         ''             
    10,0-10,0:          ENDMARKER      ''