Search code examples
arraysloopsperlforeachglobal-variables

Definition error when using columns in one file to find matching columns in another file with perl


I have a tab delimited input file in the format:

+    Chr1    www
-    Chr2    zzz
...

I would like to go line by line against a reference tab delimited file (TRANSCRIPTS in the code below) in the format of:

Chr1    +    xxx    UsefulInfo1
Chr2    -    yyy    UsefulInfo2
...

And would like an output that looks like:

+    Chr1    UsefulInfo1
-    Chr2    UsefulInfo2
...

Here is my attempt to take variable names from the command line, grab certain info from the input file, and append the useful info from the reference file:

#!/usr/bin/perl

use strict;
use warnings;
use diagnostics;

my $inFile = $ARGV[0];
my $outFile = $ARGV[1];

open(INFILE, "<$inFile") || die("Couldn't open $inFile: $!\n");
open(OUTFILE, ">$outFile") || die("Couldn't create $outFile: $!\n");

open(TRANSCRIPTS, "</path/TranscriptInfo") || die("Couldn't open reference file!");
my @transcripts = split(/\t+/, <TRANSCRIPTS>);
chomp @transcripts;

#Define desired information from input for later
while (my @columns = split(/\t+/, <INFILE>)) {
    chomp @columns;
    my $strand = $columns[0];
    my $chromosome = $columns[1];

    #Attempt to search reference file line by line for matching criteria and copying a column of matching lines
    foreach my $reference(@transcripts) {
        my $refChr = $reference[0]; #Error for this line
        my $refStrand = $reference[1]; #Error for this line
        if ($refChr eq $chromosome && $refStrand eq $strand) {
            my $info = $reference[3]; #Error for this line
            print OUTFILE "$strand\t$chromosome\t\$info\n";
        }
    }
}
    
close(OUTFILE); close(INFILE);

At the moment I receive "Global symbol "@reference" requires explicit package name." What is the proper way to define this? I'm not even entirely sure my foreach loop functions as desired even once defining the symbol properly.


Solution

  • Fixed:

    use strict;
    use warnings;
    use feature qw( say );
    
    my $in_qfn          = $ARGV[0];
    my $out_qfn         = $ARGV[1];
    my $transcripts_qfn = "/path/TranscriptInfo";
    
    my @transcripts;
    {
       open(my $transcripts_fh, "<", $transcripts_qfn)
          or die("Can't open \"$transcripts_qfn\": $!\n");
       while (<$transcripts_fh>) {
          chomp;
          push @transcripts, [ split(/\t/, $_, -1) ];
       }    
    }
    
    {
       open(my $in_fh, "<", $in_qfn)
          or die("Can't open \"$in_qfn\": $!\n");
       open(my $out_fh, ">", $out_qfn)
          or die("Can't create \"$out_qfn\": $!\n");
       while (<$in_fh>) {
          chomp;
          my ($strand, $chr) = split(/\t/, $_, -1);
          for my $transcript (@transcripts) {
             my $ref_chr    = $transcript->[0];
             my $ref_strand = $transcript->[1];
             if ($chr eq $ref_chr && $strand eq $ref_strand) {
                my $info = $transcript->[2];
                say $out_fh join("\t", $strand, $chr, $info);
             }
          }
       }
    }
    

    That said, the above is very inefficient. Let's call N the number of lines in $transcript_qfn, and let's call M the number of lines in $in_qfn. The inner loop executes a number of times equal to N*M. In fact, it needs only execute N times.

    use strict;
    use warnings;
    use feature qw( say );
    
    my $in_qfn          = $ARGV[0];
    my $out_qfn         = $ARGV[1];
    my $transcripts_qfn = "/path/TranscriptInfo";
    
    my %to_print;
    {
       open(my $in_fh, "<", $in_qfn)
          or die("Can't open \"$in_qfn\": $!\n");
       while (<$in_fh>) {
          chomp;
          my ($strand, $chr) = split(/\t/, $_, -1);
          ++$to_print{$strand}{$chr};
       }    
    }
    
    {
       open(my $transcript_fh, "<", $transcript_qfn)
          or die("Can't open \"$transcript_qfn\": $!\n");
       open(my $out_fh, ">", $out_qfn)
          or die("Can't create \"$out_qfn\": $!\n");
       while (<$transcript_fh>) {
          chomp;
          my ($ref_chr, $ref_strand, $info) = split(/\t/, $_, -1);
          next if !$to_print{$ref_strand};
          next if !$to_print{$ref_strand}{$ref_chr};
          say $out_fh join("\t", $ref_strand, $ref_chr, $info);
       }
    }