Search code examples
parsingtextrakuflip-flop

How can I extract some data out of the middle of a noisy file using Perl 6?


I would like to do this using idiomatic Perl 6.

I found a wonderful contiguous chunk of data buried in a noisy output file.

I would like to simply print out the header line starting with Cluster Unique and all of the lines following it, up to, but not including, the first occurrence of an empty line. Here's what the file looks like:

</path/to/projects/projectname/ParameterSweep/1000.1.7.dir> was used as the working directory.
....

Cluster Unique Sequences    Reads   RPM
1   31  3539    3539
2   25  2797    2797
3   17  1679    1679
4   21  1636    1636
5   14  1568    1568
6   13  1548    1548
7   7   1439    1439

Input file: "../../filename.count.fa"
...

Here's what I want parsed out:

Cluster Unique Sequences    Reads   RPM
1   31  3539    3539
2   25  2797    2797
3   17  1679    1679
4   21  1636    1636
5   14  1568    1568
6   13  1548    1548
7   7   1439    1439

Solution

  • I would like to do this using idiomatic Perl 6.

    In Perl, the idiomatic way to locate a chunk in a file is to read the file in paragraph mode, then stop reading the file when you find the chunk you are interested in. If you are reading a 10GB file, and the chunk is found at the top of the file, it's inefficient to continue reading the rest of the file--much less perform an if test on every line in the file.

    In Perl 6, you can read a paragraph at a time like this:

    my $fname = 'data.txt';
    
    my $infile = open(
        $fname, 
        nl => "\n\n",   #Set what perl considers the end of a line.
    );  #Removed die() per Brad Gilbert's comment. 
    
    for $infile.lines() -> $para {  
        if $para ~~ /^ 'Cluster Unique'/ {
            say $para.chomp;
            last;   #Quit reading the file.
        }
    }
    
    $infile.close;
    
    #    ^                   Match start of string.
    #   'Cluster Unique'     By default, whitespace is insignificant in a perl6 regex. Quotes are one way to make whitespace significant.   
    

    However, in perl6 rakudo/moarVM the open() function does not read the nl argument correctly, so you currently can't set paragraph mode.

    Also, there are certain idioms that are considered by some to be bad practice, like:

    1. Postfix if statements, e.g. say 'hello' if $y == 0.

    2. Relying on the implicit $_ variable in your code, e.g. .say

    So, depending on what side of the fence you live on, that would be considered a bad practice in Perl.