Search code examples
parsingflex-lexer

Capture names containing --but not ending-- in dashes


I am trying to capture names (not starting with a number) which could contain dashes, such as hello-world. My problem is that I also have rules for single dashes and symbols which conflict with it:

[A-Za-z][A-Za-z0-9-]+     { /* capture "hello-world" */ }
"-"                       { return '-'; }
">"                       { return '>'; }

When the lexer reads hello-world-> the previous rules yield hello-world- and >, whereas I expected hello-world, - and > to be captured individually. To solve it I fixed it this way:

[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+     { /* ensure final dash is never included at the end */ }

That works, except for single-letter words, so finally I implemented this:

[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+     { /* ensure final dash is never included at the end */ }
[A-Za-z][A-Za-z0-9]*                  { /* capture possible single letter words */ }

Question: Is there a more elegant way to do it?


Solution

  • [A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+
    [A-Za-z][A-Za-z0-9]*
    

    Note that, as you said, the first rule already covers everything that's not a single letter. So the second rule only has to match single letters and can be shortened to just [A-Za-z]:

    [A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+
    [A-Za-z]
    

    Now the second rule is a mere prefix of the first, so we can combine this into a single rule by making the part after the first letter optional:

    [A-Za-z]([A-Za-z0-9-]*[A-Za-z0-9]+)?
    

    The + on the last bit is unnecessary because everything except the last character can as well be matched by the middle part, so the simplest version is:

    [A-Za-z]([A-Za-z0-9-]*[A-Za-z0-9])?