Lots of questions on how to tokenize some string using regexp.
I'm looking however, to tokenize the regexp pattern itself, I'm sure there are some posts on the subject but I cannot find them.
Examples:
^\w$ -> ['^', '\w', '&']
[3-7]* -> ['[3-7]*']
\w+\s\w+ -> ['\w+', '\s', '\w+']
(xyz)*\s[a-zA-Z]+[0-9]? -> ['(xyz)*','\s','[a-zA-Z]+','[0-9]?']
I'm assuming this work is done in python under the hood when some regexp
function is called.
One place to start: the PyPy project has an implementation of Python in (mostly) Python. The re.py in the source distribution calls sre-compile.c:_compile() to do the work. You might be able to hack that to provide the output form you want.
Edit Also, the Javascript XRegExp library parses regexes in an extended syntax and renders them to standard syntax. The parser routine may help you.