Search code examples
regexperlmultiline

Matching multiple lines of poorly formatted text in Perl


I have data format coming like below from an external program and need to get the first 4 fields(Text, username, number and timestamp) of each line. Please note Hello line1 is one field and second one is user name. The format is output could be single line like line1 below or three lines like line2 or two lines like line4 below. And also the format can be mixed like below(not single line always or double etc)

Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM

Hello Line2

                         Line2FirstName-LastName       8       7/17/2015 1:15 PM 

Line2Testing - 12323232323 Hello There

Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM

Hello Line4

                         Line4FirstName-LastName       8       9/17/2015 1:20 PM

Screen shot of above in a editor

I was able to get Multline RegEx with the help of this question: Perl multiline regex for first 3 individual items

Thanks to @GsusRecovery!

Since i am reading line by line output i don't think i can take advantage of the multi line RegEx by reading singe line. Is it possible to read only single line if the format is in one line or read 2 lines if it is spread out in 2 or 3 lines if it is spread out in 3 lines?

Or is it only better to read each and every line and backtrack depending on double line or triple line format.

Please suggest.


Solution

  • UPDATE: i've changed the script to accept stdin and put it in @output_lines as array (to emulate the input situation of @sureng)

    I've wrapped the regex in a line accumulator that recognize the hour as a closing pattern. In this way you can parse the output line by line and yet apply the regex.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    my ($accumulator,$chat,$username,$chars,$timestamp);
    
    my @output_lines = <STDIN>;
    
    foreach (@output_lines)
    {
        $accumulator .= $_;
    
       ($chat,$username,$chars,$timestamp) = $accumulator =~ m/(?im)^\s*(.+)\s+(\w+[-,\.]\w+)\s+(\d+)\s+([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s?[ap]m)\s*$/;
        $chat =~ s/\s+$// if $chat;  #remove trailing spaces
    
        if ( $accumulator =~ /(?i)([0-2]?\d:[0-5]?\d\s?[ap]m)/ ) {
            print "SECTION matched\n";
            print "-"x80,"\n";
            print "$accumulator";
            print "-"x80,"\n";
            print "chat -> ${chat}\n";
            print "username -> ${username}\n";
            print "chars -> ${chars}\n";
            print "timestamp -> ${timestamp}\n\n";
            $accumulator = '';  # reset the line accumulator
        }
    }
    

    Try the solution online (with your example provided as stdin) here.

    In your shell, given the script above and this input file:

    # MultiLineInput.txt
    Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM
    
    Hello Line2
    
                         Line2FirstName-LastName       8       7/17/2015 1:15 PM 
    Line2Testing - 12323232323 Hello There
    
    Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM
    
    Hello Line4
    
                         Line4FirstName-LastName       8       9/17/2015 1:20 PM
    

    You can simply call:

    cat MultiLineInput.txt | StreamRegex.pl
    

    If it works as expected you can substitute the cat command with your source.

    NB: this approach is needed if you process a stream or if your file is bigger than the volatile memory of the system (and so you want to process it as a stream) but, that said, it works in any case.