Search code examples
perlbioinformaticsfastafastqsequencing

eliminate empty files in a subroutine in perl


I want to a add a code in the next script to eliminate those empty output files.

The script convert a single fastq file or all the fastq files in a folder to fasta format, all the output fasta files keep the same name of the fastq file; the script present an option to exclude all the sequences that present a determinate number of NNN repeats (NNNNNNNNNNNNNNNNNNATAGTGAAGAATGCGACGTACAGGATCATCTA), I added this option because some sequences present only NNNNN in the sequences, example: if the -n option is equal to 15 (-n 15) it will exclude all the sequences that present 15 o more N repeats, to this point the code works well, but it generate an empty files (in those fastq files that all the sequences present 15 or more N repeats are excluded). I want to eliminate all the empty files (without sequences) and add a count of how many files were eliminate because it were empty.

Code:

#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Long;

my ($infile, $file_name, $file_format, $N_repeat, $help, $help_descp,
    $options, $options_descrp, $nofile, $new_file, $count);

my $fastq_extension = "\\.fastq";

GetOptions (
    'in=s'      => \$infile,
    'N|n=i'     =>\$N_repeat,
    'h|help'    =>\$help,
    'op'        =>\$options
);

 # Help

 $help_descp =(qq(              
              Ussaje:
              fastQF -in fastq_folder/ -n 15
                      or
              fastQF -in file.fastq -n 15
              ));

 $options_descrp =(qq(

            -in      infile.fastq or fastq_folder/                  required
            -n       exclude sequences with more than N repeat      optional
            -h       Help description                               optional
            -op      option section                                 optional
                   ));

 $nofile =(qq(
            ERROR:  "No File or Folder Were Chosen !"

                Usage:
                    fastQF -in folder/

                Or See -help or -op section
           ));

 # Check Files 

    if ($help){
        print "$help_descp\n";
        exit;
    }
    elsif ($options){
        print "$options_descrp\n";
        exit;
    }

    elsif (!$infile){
        print "$nofile\n";
        exit;
    }


 #Subroutine to convert from fastq to fasta

    sub fastq_fasta {

        my $file = shift;
        ($file_name = $file) =~ s/(.*)$fastq_extension.*/$1/;

# eliminate old files 

        my $oldfiles= $file_name.".fasta";

        if ($oldfiles){
            unlink $oldfiles;
        }

        open LINE,    '<',   $file             or die "can't read or open $file\n";
        open OUTFILE, '>>', "$file_name.fasta" or die "can't write $file_name\n";

        while (
            defined(my $head    = <LINE>)       &&
            defined(my $seq     = <LINE>)       &&
            defined(my $qhead   = <LINE>)       &&
            defined(my $quality = <LINE>)
        ) {
                substr($head, 0, 1, '>');


                if (!$N_repeat){
                    print OUTFILE $head, $seq;


                }

                elsif ($N_repeat){

                        my $number_n=$N_repeat-1;

                    if ($seq=~ m/(n)\1{$number_n}/ig){
                        next;
                    }
                    else{
                        print OUTFILE $head, $seq;
                    }
                }
        }

        close OUTFILE;
        close LINE;
    }

 # execute the subrutine to extract the sequences

    if (-f $infile) {           # -f es para folder !!
        fastq_fasta($infile);
    }
    else {
        foreach my $file (glob("$infile/*.fastq")) {
        fastq_fasta($file);
        }
    }

 exit;

I have tried to use the next code outside of the subroutine (before exit) but it just work for the last file :

$new_file =$file_name.".fasta";
        foreach ($new_file){

            if (-z $new_file){
                $count++;
                if ($count==1){
                    print "\n\"The choosen File present not sequences\"\n";
                    print " \"or was excluded due to -n $N_repeat\"\n\n";

                }
                elsif ($count >=1){
                    print "\n\"$count Files present not sequences\"\n";
                    print " \" or were excluded due to -n $N_repeat\"\n\n";

                }

                unlink $new_file;
            }
        }

and I just have tried something similar inside of the subroutine but this last code don´t work !!!!

Any Advise !!!!???

Thanks So Much !!!


Solution

  • you should check, if something was written to your new file at the end of our fastq_fasta subroutine. Just put your code after the close OUTFILE statement:

    close OUTFILE;
    close LINE;
    
    my $outfile = $file_name.".fasta";
    if (-z $outfile)
    {
       unlink $outfile || die "Error while deleting '$outfile': $!";
    }
    

    Additionally, it will be better to add the die/warn statement also to the other unlink line. Empty files should be deleted.

    Maybe another solution if you are not fixed to perl, but allowed to use sed and a bash loop:

    for i in *.fastq
    do
       out=$(dirname "$i")/$(basename "$i" .fastq).fasta
       sed -n '1~4{s/^@/>/;N;p}' "$i" > "$out"
       if [ -z $out ]
       then
          echo "Empty output file $out"
          rm "$out"
       fi
    done
    

    Hope that helps!

    Best Frank