Search code examples
regexcompiler-constructionflex-lexerlexlexical-analysis

Regular expressions - Matching whitespace


I am having a big problem to write a regexp that will trim all the whitespace in my input.

I have tried \s+ and [ \t\t\r]+ but that don't work.

I need this because I am writing a scanner using flex, and I am stuck at matching whitespace. The whitespace should just be matched and not removed.

Example input:

program 
3.3 5 7 
{ comment }
string
panic: cant happen

Solution

    1. flex uses (approximately) the POSIX "Extended Regular Expression" syntax -- \s doesn't work, because it's a Perl extension.

    2. Is [ \t\t\r]+ a typo? I think you'll want a \n in there.

    Something like [ \n\t\r]+ certainly should work. For example, this lexer (which I've saved as lexer.l):

    %{
    
    #include <stdio.h>
    
    %}
    
    %option noyywrap
    
    %%
    
    [ \n\t\r]+  { printf("Whitespace: '%s'\n", yytext); }
    [^ \n\t\r]+ { printf("Non-whitespace: '%s'\n", yytext); }
    
    %%
    
    int main(void)
    {
        yylex();
        return 0;
    }
    

    ...successfully matches the whitespace in your example input (which I've saved as input.txt):

    $ flex lexer.l
    $ gcc -o test lex.yy.c
    $ ./test < input.txt
    Non-whitespace: 'program'
    Whitespace: ' 
    '
    Non-whitespace: '3.3'
    Whitespace: ' '
    Non-whitespace: '5'
    Whitespace: ' '
    Non-whitespace: '7'
    Whitespace: ' 
    '
    Non-whitespace: '{'
    Whitespace: ' '
    Non-whitespace: 'comment'
    Whitespace: ' '
    Non-whitespace: '}'
    Whitespace: '
    '
    Non-whitespace: 'string'
    Whitespace: '
    '
    Non-whitespace: 'panic:'
    Whitespace: ' '
    Non-whitespace: 'cant'
    Whitespace: ' '
    Non-whitespace: 'happen'
    Whitespace: '
    '