Search code examples
pythonstringtokenize

tokenize according space and punctuation, punctuation kept


I'm looking for a solution to tokenize or split according spaces or punctuation. Only the punctuation must be kept in the result. It will be using to recognize language (python, java, html, c...)

The input string could be:

class Foldermanagement():
def __init__(self):
    self.today = invoicemng.gettoday()
    ...

the output I'm expecting is a list/tokenized as described below:

['class', 'Foldermanagement', '(', ')', ':', 'def', '_', '_', 'init', ... ,'self', '.', 'today', '=', ...]

Any solution is welcome, thanks in advance.


Solution

  • I think here's what you are looking for:

    import string, re, itertools
    text = """
    class Foldermanagement():
    def __init__(self):
        self.today = invoicemng.gettoday()
           """
    separators = string.punctuation + string.whitespace
    separators_re = "|".join(re.escape(x) for x in separators)
    tokens = zip(re.split(separators_re, text), re.findall(separators_re, text))
    flattened = itertools.chain.from_iterable(tokens)
    cleaned = [x for x in flattened if x and not x.isspace()]
    # ['class', 'Foldermanagement', '(', ')', ':', 'def', '_', '_',
    #  'init', '_', '_', '(', 'self', ')', ':', 'self', '.', 'today', '=', 
    #  'invoicemng', '.', 'gettoday', '(', ')']