Search code examples
regexperlmultiline

Perl multiline regex for first 3 individual items


I am trying to read a regex format in Perl. Sometimes instead of a single line I also see the format in 3 lines.

For the below single line format I can regex as

/^\s*(.*)\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)/

to get the first 3 individual items in line

Hi There       FirstName.LastName    10  3/23/2011 2:46 PM

Below is the multi-line format I see. I am trying to use something like

/^\s*(.*)\n*\n*|\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)$/m

to get individual items but don’t seem to work.

Hi There    

                         FirstName-LastName       8       7/17/2015 1:15 PM 

Testing - 12323232323 Hello There

Any suggestions? Is multi-line regex possible?

NOTE: In the same output i can see either Single line or Multi line or both so output can be like below

Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM

Hello Line2

                         Line2FirstName-LastName       8       7/17/2015 1:15 PM 

Testing - 12323232323 Hello There

Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM


Solution

  • You can for sure apply regex over multiple lines.

    I've used the negated word \W+ between words to match space and newlines between words (actually \W is equal to [^a-zA-Z0-9_]). The chat is viewed as a repetead \w+\W+ block.

    If you provide more specific input / output case i can refine the example code:

    #!/usr/bin/env perl
    
    my $input = <<'__END__';
    Hi There    
    
                             FirstName-LastName       8       7/17/2015 1:15  PM 
    
    Testing - 12323232323 Hello There
    __END__
    
    my ($chat,$username,$chars,$timestamp) = $input =~ m/(?im)^\s*((?:\w+\W+)+)(\w+[-,\.]\w+)\W+(\d+)\W+([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s?[ap]m)/;
    
    $chat =~ s/\s+$//;  #remove trailing spaces
    
    print "chat -> ${chat}\n";
    print "username -> ${username}\n";
    print "chars -> ${chars}\n";
    print "timestamp -> ${timestamp}\n";
    

    Legenda

    • m/^.../ match regex (not substitute type) starting from start of line
    • (?im): case insensitive search and multiline (^/$ match start/end of line also)
    • \s* match zero or more whitespace chars (matches spaces, tabs, line breaks or form feeds)
    • ((?:\w+\W+)+) (match group $chat) match one or more a pattern composed by a single word \w+ (letters, numbers, '_') followed by not words \W+(everything that is not \w including newline \n). This is later filtered to remove trailing whitespaces
    • (\w+[-,\.]\w+): (match group $username) this is our weak point. If the username is not composed by two regex words separated by a dash '-' or a comma ',' (UPDATE) or a dot '.' the entire regex cannot work properly (i've extracted both the possibilities from your question, is not directly specified).
    • (\d+): (match group $chars) a number composed by one or more digits
    • ([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s[ap]m): (match group $timestamp) this is longer than the others split it up:
      • [0-1]?\d\/[0-3]?\d\/[1-2]\d{3} match a date composed by month (with an optional leading zero), a day (with an optional leading zero) and a year from 1000 to 2999 (a relaxed constraint :)
      • [0-2]?\d:[0-5]?\d\s?[ap]m match the time: hour:minutes,optional space and 'pm,PM,am,AM,Am,Pm...' thanks to the case insensitive modifier above

    You can test it online here