ply lexmatch regular expression has different groups than a usual re

I am using ply and have noticed a strange discrepancy between the token re match stored in t.lex.lexmatch, as compared with an sre_pattern defined in the usual way with the re module. The group(x)'s seem to be off by 1.

I have defined a simple lexer to illustrate the behavior I am seeing:

import ply.lex as lex

tokens = ('CHAR',)

def t_CHAR(t):
    r'.'
    t.value = t.lexer.lexmatch
    return t

l = lex.lex()

(I get a warning about t_error but ignore it for now.) Now I feed some input into the lexer and get a token:

l.input('hello')
l.token()

I get a LexToken(CHAR,<_sre.SRE_Match object at 0x100fb1eb8>,1,0). I want to look a the match object:

m = _.value

So now I look at the groups:

m.group() => 'h' as I expect.

m.group(0) => 'h' as I expect.

m.group(1) => 'h', yet I would expect it to not have such a group.

Compare this to creating such a regular expression manually:

import re
p = re.compile(r'.')
m2 = p.match('hello')

This gives different groups:

m2.group() = 'h' as I expect.

m2.group(0) = 'h' as I expect.

m2.group(1) gives IndexError: no such group as I expect.

Does anyone know why this discrepancy exists?

Solution

In version 3.4 of PLY, the reason this occurs is related to how the expressions are converted from docstrings to patterns.

Looking at the source really does help - line 746 of lex.py:

c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)

I wouldn't recommend relying on something like this between versions - this is just part of the magic of how PLY works.