perl bioinformatics bioperl fastq sequencing

Processing FASTQ files based on mate pair length

The following files are two mates of a paired-end fastq file, I want to separate each fastq based on their length.

mate1.fq:

@SRR127.1
TGGTTATGATGTTTGTGTAGGAATAGAAATTTTGATTAAGATATTAGTGAAATTTGAATGTAGTTTATTTGGAAGTTATGGAGAGTTTATATTGTATTTATGTTTATTGTTGTAGATTTATATTTATGTGTATATATTAGTTTTTTTGTGT
+
ABAAAF4FFFFFGGGGGGFFGGFGHGFGHHHHHGGCFFGHHHHH5FDBED55DGGFEGFHHHGBHDDHHHFF3AB3FFG5CBGBEF5BD5DGFEGHFAGAFEDGHGFHHGHGEFFGFGGHFEGHHFHGBEBGHHHHGHBHHFHHGGFGHH2
@SRR127.2
TATGGTAAGAAAATTGAAAATTATAAAAAATGAAAAATGTTTATTTGATGATTTGAAAAATGATGAAATTATTGAAAAATGTGAAAAATGAGAAATGTATATTGTAGGATTTGGAATATGGTGAGATAAATGAAAATTATAGTAAATG
+
AABAA5@D4@5CFFCA55FFGGHDGFHFFCC45DGFA2FA5DD55AAAA55DDBDEDDBGGFF5BA5DDABF5D5B5FF1ADFB5EDGHFG5@BFBD55D5FFB@@5@GBGEFBGHHGB@DBBFHFBDG3B43FFH@FGFHH?FHHHH

mate2.fq:

@SRR127.1
ACCTATAAAAAAACCATATCAATAACTATAAAATCTTTATAAAATCCCACCCAATTAAAAAAAAATAAATTAATACATATAAAACCTTAAACACATAAAACATAATCACATACTATATAAACAATTACTATCACTACTAAACACCTAATA
+
>AA?AF13B@D@1EFCGGGFFG3EBGHHHBB2FGHHGHGFDGHHDFEGFHGGGHG1FFF1GGCGGGBGHHHHHFHHHHFHEGGFHF0BD1FGHHAGEGHFHHHFGGFHHGHHHFHHGGFHBGHFED1FBGFGFHDGHGHFGG1GB0GFHH
@SRR127.2
CTATTTCTCATTTTTTTATAATTTTCAATTCTCTTACCATATTCCACATCCTACACTAAACATTTCTAAATTTTCCACCTTTTTCTATTTTTCTCACCATATTTCATATCCTAAAAAACATATTCCTCATTTACTATAATTTTCAATTATC
+
11>>AFFDFF3@FFF?EFFGFBGHFDFA33D2FF2GGHFE12DD221AF1F1E1BG1GGBFBGGEGHDAABGAGDFABGG1BBDF12A2@2BG@2@DEFFF2B2@2222BB2211FGEE/11@22B2>1B22F2>GBGBD22BGD2>2B22

I wrote the following code to do this but I get a strange error only for the second file (mate2.fq) while both of them also have 151 bp reads.

#!/usr/bin/perl

use strict;
use warnings;

my @fh;

my $file_name = $ARGV[0];
my $infile    = $ARGV[1];

#convert every 4-line fastq to 1-line
open(FH, "cat '$infile' | awk '{printf \"%s%s\",\$0,(NR%4?FS:RS)}' | ");

while (<FH>) {
  chomp;

  my @line = split(/\s+/, $_);
  my $len  = length($line[1]);

  if ($len >= 100) {

    #print $len,"\n",$_,"\n";
    push @fh, $len;

    if (not defined $fh[$len]) {
      open $fh[$len], '>', "$file_name\_$len";
    }
    print { $fh[$len] } (join("\n", @line), "\n");
  }

}

Error:

Can't use string ("151") as a symbol ref while "strict refs" in use at

How can I process these files?

Solution

As you have read, your problem is because of a spurious push that adds an integer value to the end of the @fh array. I presume you were aiming to extend the array to be long enough to add the new file handle. You can do that by assigning to $#fh, so you would write $#fh = $len if $#fh < $len; however it is unnecessary because Perl will extend arrays automatically for you when you simply assign to an element off the end of the array

I have a couple of comments on your program that I hope you find useful

It is unnecessary and wasteful to shell out to an awk command. Perl is quite capable of doing all that awk can do
If you find yourself writing split /\s+/, $_ then you almost certainly mean just split: the default behaviour is to do split ' ', $_. If you use /\s+/ as the pattern and there happens to be leading whitespace on the string you are splitting, then split will return an empty string as the first item in the list of fields. If you use ' ' instead (a literal single space, not the pattern / /) then this won't happen. In effect, split ' ' is equivalent to /\S+/g
When interpolating variable values within a string it's generally neater to put identifiers inside braces if there is a following character that could be part of the identifer. So "${file_name}_$len" instead of "$file_name\_$len"

This is how I would write your code. It accumulates the input records into $line until four records have been added, and then processes that line as before.

#!/usr/bin/perl

use strict;
use warnings;

my ($file_name, $infile) = @ARGV;

open my $in_fh, '<', $infile or die $!;
my $line;

my @fh;
while ( <$in_fh> ) {
  chomp;
  $line .= $_;

  if ( $. % 4 == 0 or eof ) {

    my @line = split ' ', $line;
    my $len  = length $line[1];
    next if $len < 100;

    open $fh[$len], '>', "${file_name}_$len" unless $fh[$len];
    print { $fh[$len] } "$_\n" for @line;

    $line = undef;
  }
}