Search code examples
cregexflex-lexeridentifierlexical-analysis

Regular expressions in flex has error


I am new in flex and I want to design a scanner using flex.

At this step, I want to make regular expression to match with id, but here are some conditions:

  1. underline can exist in id

  2. you can use _ whenever you want, but if you are using them exactly consequently it can be at most 2 underlines for example :

    a_b_c »»»» true

    a___b »»»» false

    123abv »»»» false

  3. integers can't be at the beginning of an id

  4. underline can't exist at the end of an id

The regular expression I have written for that is :

(\b(_{0,2}[A-Za-z][0-9A-Za-z]*(_{0,2}[0-9A-Za-z]+)*)\b)

but now I have 2 questions:

  1. Is the regular expression true? I have tested it in rubular.com and I think this is true but I'm not sure?

  2. The other important problem is that when I write this in my flex file, Unfortunately no id is identified. But I can't why it is not recognized

Can anyone please help me?


Solution

  • The problem here is your ID regular expression. You are using \b to match a word boundary, but Flex's regular expressions have no built-in support for matching word boundaries. Other than that, your regular expression is sound. I was able to get your code working using this modified version of yours: _{0,2}[A-Za-z][0-9A-Za-z]*(_{0,2}[0-9A-Za-z]+)*. (I just got rid of the \b's, and some of the parentheses that bothered me).

    Unfortunately, this causes a slight problem. Say that you're lexing and run across something like 12_345. Flex will read 12, assume that it found an IC, and then read _. Finding no match, it will print that to stdout, then read 345 as another IC.

    In order to avoid this issue (caused by Flex's lack of word boundaries), you could do one of two things:

    • Create a rule at the end that matches any character (other than whitespace) and make it give an error. This would stop Flex when it got to _ in the example above.
    • Create a rule at the end that matches any combination of letters, numbers, and underscores ([_0-9A-Za-z]+). If it is matched, give an error. This will cause Flex to return the entire token 12_345 as an error in the above example.

    One other problem: The ID regular expression still won't match anything with underscores at the end of it. This means your current regular expression isn't perfect, and you'll need to do some tweaking with it, but now you know not to use the \b symbol. Here is a reference on Flex's regular expression syntax so you can find other things to use/avoid.