Search code examples
pythonregextokenize

Capturing repeating sub-patterns with permutations in Python regex


I am trying to tokenize a string made of sub-patterns that can appear in any order. The sub-patterns are underscore, letters or numbers. For example:

   'ABC_123_DEF_456' would provide ('ABC', '_', '123', '_', 'DEF', '_', '456')

Here is the implemented regex giving the unexpected result:

>>> m = regex.match(r'^((_)|(\d+)|([[:alpha:]]+))+$', 'ABC_123_DEF_456')
>>> m.groups()
('456', '_', '456', 'DEF')

Updates: - permutations: the three sub-patterns can appear in any order for example:

'ABC123__' would provide ('ABC', '123', '_', '_')

Solution

  • You can use /([a-z]+|\d+|_)/i to chunk the string into groups of digits, alphabetical characters or single underscores:

    >>> re.findall(r"([a-z]+|\d+|_)", "ABC_123_DEF_456", re.I)
    ['ABC', '_', '123', '_', 'DEF', '_', '456']
    >>> re.findall(r"([a-z]+|\d+|_)", "ABC123__", re.I)
    ['ABC', '123', '_', '_']