Search code examples
regexperlfasta

perl to check if sequence contains at least 3 unique bases and if not delete


I have a fasta file. I need to remove sequences containing “N” or did not contain at least 3 unique bases. The code so far is below. Also how would I remove the sequence ID line as go along for sequences I delete.

#!/usr/bin/perl
use strict;
use warnings;

open FILE, '<', $ARGV[0] or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open match_fh, ">$ARGV[0]_trimmed.fasta"
    or die qq{Failed to open for output: $!\n};

while ( my $line = <FILE> ) {
    chomp($line);

    if ( $line =~ m/^>/ ) {
        print match_fh "$line\n";

        my @data = split( /\|/, $line );

        my $nextline = <FILE>;

        if ( $nextline !~ /N+/g ) {

            if ( $nextline =~ /[ATGC]{3}/g ) {

            }
            print match_fh "$nextline";
        }
    }
}

close FILE;
close match_fh;



INPUT
>seq1
ATGCGGGATGATCCGAACGTTTAATCTCGTATGCCGTCTTCTATCTCNNN
>seq2
GATGAGCTTGACTCTAGTCCATCTCGTATGCCGTCTTCTGCTATCTCGTA
>seq3
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTC
>seq4
TGGTACTGTAAGCATGAGAGTAATCTCGTATGCCGTCTTCTGCTTGAAAA

OUTPUT
>seq2
GATGAGCTTGACTCTAGTCCATCTCGTATGCCGTCTTCTGCTATCTCGTA
>seq4
TGGTACTGTAAGCATGAGAGTAATCTCGTATGCCGTCTTCTGCTTGAAAA

Solution

  • while(my $head = <FILE>) {
      next if($head !~ /^>/);
      $_=<FILE>;
      if(!/N+/ && /A/+/T/+/G/+/C/ >= 3) {
        print match_fh $head, $_;
      }
    }