This is the code:
aggregate(results ~ school, FUN = table, data = df)
The above code is written in R. Is there any tool available in R to extract tokens so above becomes:
FUNC_NAME(DATA ~ DATA, PARA_FUN = DATA, PARA_DATA = DATA)
I tried minilexer to split some simplified R code into tokens. But the rules are really simple. I'm wondering if there is any tool that have implemented all rules already, so I don't need to reinvent the wheel.
You can get the results from the R lexer using
getParseData(parse(text="aggregate(results ~ school, FUN = table, data = df)"))
# line1 col1 line2 col2 id parent token terminal text
# 27 1 1 1 51 27 0 expr FALSE
# 1 1 1 1 9 1 3 SYMBOL_FUNCTION_CALL TRUE aggregate
# 3 1 1 1 9 3 27 expr FALSE
# 2 1 10 1 10 2 27 '(' TRUE (
# 10 1 11 1 26 10 27 expr FALSE
# 4 1 11 1 17 4 6 SYMBOL TRUE results
# 6 1 11 1 17 6 10 expr FALSE
# 5 1 19 1 19 5 10 '~' TRUE ~
# 7 1 21 1 26 7 9 SYMBOL TRUE school
# 9 1 21 1 26 9 10 expr FALSE
# 8 1 27 1 27 8 27 ',' TRUE ,
# 13 1 29 1 31 13 27 SYMBOL_SUB TRUE FUN
# 14 1 33 1 33 14 27 EQ_SUB TRUE =
# 15 1 35 1 39 15 17 SYMBOL TRUE table
# 17 1 35 1 39 17 27 expr FALSE
# 16 1 40 1 40 16 27 ',' TRUE ,
# 20 1 42 1 45 20 27 SYMBOL_SUB TRUE data
# 21 1 47 1 47 21 27 EQ_SUB TRUE =
# 22 1 49 1 50 22 24 SYMBOL TRUE df
# 24 1 49 1 50 24 27 expr FALSE
# 23 1 51 1 51 23 27 ')' TRUE )
Internally it looks like R is using the Bison lexer. The grammar it uses is defined in the gram.y file of the source code. You should be able to get all the information you need from that. It's better to rely on the built-in lexer rather than having a package try to re-implement the build in one