Search code examples
rlexical-analysis

How to create tokens from R code implemented in R


This is the code:

aggregate(results ~ school, FUN = table, data = df)

The above code is written in R. Is there any tool available in R to extract tokens so above becomes:

FUNC_NAME(DATA ~ DATA, PARA_FUN = DATA, PARA_DATA = DATA)

I tried minilexer to split some simplified R code into tokens. But the rules are really simple. I'm wondering if there is any tool that have implemented all rules already, so I don't need to reinvent the wheel.


Solution

  • You can get the results from the R lexer using

    getParseData(parse(text="aggregate(results ~ school, FUN = table, data = df)"))
    
    #    line1 col1 line2 col2 id parent                token terminal      text
    # 27     1    1     1   51 27      0                 expr    FALSE          
    # 1      1    1     1    9  1      3 SYMBOL_FUNCTION_CALL     TRUE aggregate
    # 3      1    1     1    9  3     27                 expr    FALSE          
    # 2      1   10     1   10  2     27                  '('     TRUE         (
    # 10     1   11     1   26 10     27                 expr    FALSE          
    # 4      1   11     1   17  4      6               SYMBOL     TRUE   results
    # 6      1   11     1   17  6     10                 expr    FALSE          
    # 5      1   19     1   19  5     10                  '~'     TRUE         ~
    # 7      1   21     1   26  7      9               SYMBOL     TRUE    school
    # 9      1   21     1   26  9     10                 expr    FALSE          
    # 8      1   27     1   27  8     27                  ','     TRUE         ,
    # 13     1   29     1   31 13     27           SYMBOL_SUB     TRUE       FUN
    # 14     1   33     1   33 14     27               EQ_SUB     TRUE         =
    # 15     1   35     1   39 15     17               SYMBOL     TRUE     table
    # 17     1   35     1   39 17     27                 expr    FALSE          
    # 16     1   40     1   40 16     27                  ','     TRUE         ,
    # 20     1   42     1   45 20     27           SYMBOL_SUB     TRUE      data
    # 21     1   47     1   47 21     27               EQ_SUB     TRUE         =
    # 22     1   49     1   50 22     24               SYMBOL     TRUE        df
    # 24     1   49     1   50 24     27                 expr    FALSE          
    # 23     1   51     1   51 23     27                  ')'     TRUE         )
    

    Internally it looks like R is using the Bison lexer. The grammar it uses is defined in the gram.y file of the source code. You should be able to get all the information you need from that. It's better to rely on the built-in lexer rather than having a package try to re-implement the build in one