I am trying to capture names (not starting with a number) which could contain dashes, such as hello-world
. My problem is that I also have rules for single dashes and symbols which conflict with it:
[A-Za-z][A-Za-z0-9-]+ { /* capture "hello-world" */ }
"-" { return '-'; }
">" { return '>'; }
When the lexer reads hello-world->
the previous rules yield hello-world-
and >
, whereas I expected hello-world
, -
and >
to be captured individually. To solve it I fixed it this way:
[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+ { /* ensure final dash is never included at the end */ }
That works, except for single-letter words, so finally I implemented this:
[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+ { /* ensure final dash is never included at the end */ }
[A-Za-z][A-Za-z0-9]* { /* capture possible single letter words */ }
Question: Is there a more elegant way to do it?
[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+ [A-Za-z][A-Za-z0-9]*
Note that, as you said, the first rule already covers everything that's not a single letter. So the second rule only has to match single letters and can be shortened to just [A-Za-z]
:
[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+
[A-Za-z]
Now the second rule is a mere prefix of the first, so we can combine this into a single rule by making the part after the first letter optional:
[A-Za-z]([A-Za-z0-9-]*[A-Za-z0-9]+)?
The +
on the last bit is unnecessary because everything except the last character can as well be matched by the middle part, so the simplest version is:
[A-Za-z]([A-Za-z0-9-]*[A-Za-z0-9])?