I'm looking for a solution to tokenize or split according spaces or punctuation. Only the punctuation must be kept in the result. It will be using to recognize language (python, java, html, c...)
The input string
could be:
class Foldermanagement():
def __init__(self):
self.today = invoicemng.gettoday()
...
the output I'm expecting is a list/tokenized as described below:
['class', 'Foldermanagement', '(', ')', ':', 'def', '_', '_', 'init', ... ,'self', '.', 'today', '=', ...]
Any solution is welcome, thanks in advance.
I think here's what you are looking for:
import string, re, itertools
text = """
class Foldermanagement():
def __init__(self):
self.today = invoicemng.gettoday()
"""
separators = string.punctuation + string.whitespace
separators_re = "|".join(re.escape(x) for x in separators)
tokens = zip(re.split(separators_re, text), re.findall(separators_re, text))
flattened = itertools.chain.from_iterable(tokens)
cleaned = [x for x in flattened if x and not x.isspace()]
# ['class', 'Foldermanagement', '(', ')', ':', 'def', '_', '_',
# 'init', '_', '_', '(', 'self', ')', ':', 'self', '.', 'today', '=',
# 'invoicemng', '.', 'gettoday', '(', ')']