Struggling to find a Python library of script to tokenize (find specific tokens like function definition names, variable names, keywords etc.).
I have managed to find keywords, whitespaces etc. using something like this but I found it quite a challenge for function/class definition names etc. I was hoping of using a pre-existent script; I explored Pygments with no success. Its lexer seems amazing for what I want but have no idea how to utilize it in Python and to also get positions for each found token.
For example I am looking at doing something like that:
int fac(int n)
{
return (n>1) ? n∗fac(n−1) : 1;
}
from the source code above I would like to get:
function_name: 'fac' at position (x, y) variable_name: 'n' at position (x, y+8)
EDITED: Any suggestions will be appreciated since I am in the dark here regarding tokenizations and parsing in C++?
Eli Bendersky is a smart guy, and sometimes active here on SO. He's got a blog post on this issue which I'll refer you directly to: Parsing C++ in Python with Clang.
Because things disappear, here's the takeaway:
Eli Bendersky wrote a C language (not C++) parser in Python, called pycparser
. People keep asking him if he's going to add support for C++. He is not. He recommends instead that people use the Python bindings for libclang to get access to "a C API that the Clang team vows to keep relatively stable, allowing the user to examine parsed code at the level of an abstract syntax tree (AST)".
You can find the bindings separately on PyPI here. Note though that you'll have to have clang installed, so you may just want to point your PYTHON_PATH directly at the install location.