Search code examples
pythonregexfindall

Why does re.findall return a list of tuples when my pattern only contains one group?


Say I have a string s containing letters and two delimiters 1 and 2. I want to split the string in the following way:

  • if a substring t falls between 1 and 2, return t
  • otherwise, return each character

So if s = 'ab1cd2efg1hij2k', the expected output is ['a', 'b', 'cd', 'e', 'f', 'g', 'hij', 'k'].

I tried to use regular expressions:

import re
s = 'ab1cd2efg1hij2k'
re.findall( r'(1([a-z]+)2|[a-z])', s )

[('a', ''),
 ('b', ''),
 ('1cd2', 'cd'),
 ('e', ''),
 ('f', ''),
 ('g', ''),
 ('1hij2', 'hij'),
 ('k', '')]

From there i can do [ x[x[-1]!=''] for x in re.findall( r'(1([a-z]+)2|[a-z])', s ) ] to get my answer, but I still don't understand the output. The documentation says that findall returns a list of tuples if the pattern has more than one group. However, my pattern only contains one group. Any explanation is welcome.


Solution

  • You pattern has two groups, the bigger group:

    (1([a-z]+)2|[a-z])
    

    and the second smaller group which is a subset of your first group:

    ([a-z]+)
    

    Here is a solution that gives you the expected result although mind you, it is really ugly and there is probably a better way. I just can't figure it out:

    import re
    s = 'ab1cd2efg1hij2k'
    a = re.findall( r'((?:1)([a-z]+)(?:2)|([a-z]))', s )
    a = [tuple(j for j in i if j)[-1] for i in a]
    
    >>> print a
    ['a', 'b', 'cd', 'e', 'f', 'g', 'hij', 'k']