Search code examples
regexlex

regular expression to match undetermined number of prefix


In lex, the following strings should be matched and captured

semic
semiconductor
semicondu

using c to do it, the code might be like the following:

strlen(str)>=5 && strncmp(str, "semiconductor", strlen(str))==0;

How can a regular expression or a (lex rule and action) do this?


Solution

  • The regular expression for prefix identification would be:

    semic(o(n(d(u(c(t(or?)?)?)?)?)?)?)?
    

    That's a bit intimidating but it's easy to create programmatically. (See below.)

    It's worth noting that (f)lex rules like the above are really only useful if accompanied by a fallback pattern which recognises all otherwise unmatched "words" (for some definition of "word"). Otherwise, the pattern would simply match the longest matching prefix of whatever word happened to be next in the input stream.

    If you have a lot of tokens which need to be handled in this fashion, the task of creating the above pattern seems a bit intimidating. However, it's very easy to write a preprocessor in some scripting language which can be applied before invoking (f)lex on the scanner definition file.

    Here, for example, is a simple Python program which filters the supplied file (or standard input), replacing words contained between the delimiters {: and :} with the "longest prefix" pattern shown above:

    File: prefixify

    #!/usr/bin/env python3
    import fileinput
    import re
    from sys import stdout
    prefix_pat = re.compile(r'\{:(\w*)(\w):\}')
    def prefixify(m):
        return ( ''.join('('+a for a in m[1])
               + m[2] + '?'
               + ')?' * len(m[1])
               )
    for line in fileinput.input():
        stdout.write(prefix_pat.sub(prefixify, line))
    

    Note: you must chmod a+x prefixify.

    I chose {:...:} because it's extremely unlikely to be used either in a (f)lex pattern or in C/C++ code, so the script doesn't have to make any attempt to verify the context in which the delimiters appear. Of course, "extremely unlikely" is not the same as "impossible". An alternative would be to let the script itself figure out how long the necessary prefix of each keyword should be, perhaps using a combination of the shortest unique prefix for each keyword and some minimum prefix length.

    For example:

    $ cat scanner.l.in
    %option noinput nounput noyywrap nodefault
    %{
    #import "parse.tab.h"
    %}
    
    ws    (#.*|[[:space:]])+
    %%
    {ws}                         /* Ignore whitespace and comments */ ;
    resi{:stor:}                 { return T_RESISTOR; }
    semi                         { return T_SEMI; }
    semic{:onductor:}            { return T_SEMICONDUCTOR; }
    tran{:sformer:}              { return T_TRANSFORMER; }
    trac{:k:}                    { return T_TRACK; }
    [[:alpha:]_][[:alnum:]_]*}   { yylval.str = strdup(yytext); return ID; }
    .                            { return *yytext; }
    
    $ # The expected command line, probably in a makefile would be
    $ # /path/to/prefixify scanner.l.in > scanner.l
    $ # followed by flex -o scanner.c scanner.l
    $ ./prefixify scanner.l.in
    %option noinput nounput noyywrap nodefault
    %{
    #import "parse.tab.h"
    %}
    
    ws    (#.*|[[:space:]])+
    %%
    {ws}                         /* Ignore whitespace and comments */ ;
    resi(s(t(or?)?)?)?                 { return T_RESISTOR; }
    semi                         { return T_SEMI; }
    semic(o(n(d(u(c(t(or?)?)?)?)?)?)?)?            { return T_SEMICONDUCTOR; }
    tran(s(f(o(r(m(er?)?)?)?)?)?)?              { return T_TRANSFORMER; }
    track?                    { return T_TRACK; }
    [[:alpha:]_][[:alnum:]_]*}   { yylval.str = strdup(yytext); return ID; }
    .                            { return *yytext; }