Search code examples
regexlex

What is the regular expression in LEX to match a string NOT starting with newline


I want to know the regular expression in lex to match a string which does not (start the line and is followed by optional white spaces followed by "a="). I am trying to parse a language with the following types of lines:

     a=some value
     b=some value

The strings "a=" (b=, etc.) can be preceded by white spaces and followed by another string without any white spaces after the = and upto newline. For example:

     a=123 abcde

Here "123 abcde" is the value. Problem is that I may encounter, at least in theory, the following

     a=123 a= 

Or worse:

     a=a=

Where the first a= is the key and the second a= is now part of the value and not the key attribute. How do I distinguish the first a= token from the second?

I can match the key "a=" with the following which handles leading whitespace:

    ^[ \r\t]*"a="  

But how do I match the second string? I need a regular expression of the type that says match a string that does NOT (start the line and is followed by optional whitespaces followed by a=) and extends upto newline character. The main trick is to avoid the expression matching the attribute a= also.


Solution

  • Use a start condition to create a different lexical context for the input after the =.

    Lex works best with a language in which tokenisation is not context-dependent (most programming languages but few ad hoc interchange formats). But start conditions are manageable if you don't have too many contexts to juggle.

    See the manual for details and examples.

    Simple example:

    %x RHS
    %%
    [[:space:]]+  ; /* Ignore leading white space and blank lines */
    a=            { BEGIN(RHS); return TOKEN_A; }
    b=            { BEGIN(RHS); return TOKEN_B; }
    .*            ; /* Ignore other input. Should do something else */
    <RHS>.+       { yylval = strdup(yytext); return VALUE; }
    <RHS>\n       { BEGIN(INITIAL); }
    

    Note: The RHS rules send nothing if there is no value. That shouldn't be a problem for a parser but if it is, you can fix it reasonably easily.