Search code examples
cregexcompiler-constructionyacclex

Regex for lex file to match printf and scanf statements


I am trying to make a compiler where the lex file matches the following simple variants of printf and scanf statements:

printf("\n Enter your string:");
scanf("%s",str);
scanf("%d",&prelength);

In the scanf examples str is declared as char str[20] and prelength as int prelength.

The regex that I currently include in my lex file is the following (for scanf and printf respectively):

scanf\(\"([\w\W]*(%[d|c|f|lf|s])*)+\"(,\s*&?[a-zA-Z]+)*\); 
printf\(\"([\w\W]*(%[d|c|f|lf|s])*)+\"(,\s*[a-zA-Z]+)*\); 

I don't know why the above regular expressions aren't matching with the above given examples of printf and scanf (similar to those found in c, but simpler).


Solution

  • Your two scanf lines are actually matched successfully. The printf line isn't matched because the pattern for the string literal does not match. The problem is that lex does not understand \w or \W, so [\w\W] only matches the letters w and W.

    If lex did support \w and \W, then [\w\W] would match every character that is or isn't a "word character". In other words it would match everything. So this tells us that instead of [\w\W], you can just write ., which is supported by lex and does match everything. It also tells us that the (%[d|c|f|lf|s])*)+ bit is redundant because everything that could be matched by that part would already have been matched by the .* part. Consequently the + qualifier on the outside is also redundant.

    So with that in mind the regex for string literals would become \".*\" (which doesn't match newlines, but that's okay because C doesn't allow unescaped newlines in string literals). The problem with that is that this will match everything from the first " in the input to the last ", not the next ". So you want to prohibit "s from appearing within the string. However, a " inside a string is allowed when it is escaped by preceding it with a backslash (and so are newlines). So taking all that into account, a suitable regex for string literals is:

    \"(\\(.|\n)|[^\n\\"])*\"