Search code examples
regexperl

Perl: match regex from the file


I have a tab-delimited file that contains information about itemsets. Each itemset consists of one to three items:

MTMR14_Q1   NOTCH1_Q3   PRKCD_Q1        
MTMR14_Q1   NOTCH1_Q3   TFRC_Q3     
MTMR14_Q1   NOTCH1_Q3           
MTMR14_Q1           
MTMR14_Q1   PASD1_Q3

My goal is to retrieve itemsets with three items only:

MTMR14_Q1   NOTCH1_Q3   PRKCD_Q1        
MTMR14_Q1   NOTCH1_Q3   TFRC_Q3 

I have wrote the following code, but it does not retrieve any itemsets:

#!/usr/bin/perl -w

use strict;

my $input = shift @ARGV or die $!; 

open (FILE, "$input") or die $!;

while (<FILE>) {
    my $seq = $_;
    chomp $seq;
        
    if ($seq =~ /[A-Z]\t[A-Z]\t[A-Z]/) {  
        #using the binding operator to match a string to a regular expression
    
        print $seq . "\n";
    }
}

close FILE;

Could you, please, pinpoint my error?


Solution

  • [A-Z] matches a single letter.


    Skip lines that don't contain exactly 3 fields:

    next if $seq !~ /^ [^\t]* \t [^\t]* \t [^\t]* \z/x;
    

    [^\t]* matches any number of non-tab characters.


    Skip lines that don't contain exactly 3 non-empty fields:

    next if $seq !~ /^ [^\t]+ \t [^\t]+ \t [^\t]+ \z/x;
    

    [^\t]+ matches any one-or-more non-tab characters.


    Presumably, you'll be following up by parsing the lines to get the three fields. If so, you could parse first and check after, like the following does:

    my @fields = split /\t/, $seq, -1;
    
    next if @fields != 3;                    # Require exactly 3 fields.
    
    next if ( grep length, @fields ) != 3;   # Requite exactly 3 non-empty fields.