Search code examples
perlfastabioperl

Using Bioperl to alter nucleotides at specific positions in fasta file?


I am trying to adapt a Bioperl script to change nucleotides at specific positions in a fasta file and output a new file with altered sequences.

Example of fasta input:

>seq1
AAATAAA

Example of nucleotide postions to change file:

##fileformat=VCFv4.1                
##samtoolsVersion=0.1.18 (r982:295)             
#CHROM  POS REF ALT
seq_1   4   G   A

The output of my script should be:

seq_1  AAAGAAA

This is my current script:

 #!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;
use Bio::Tools::CodonTable;
use Bio::Seq;


my $original = shift @ARGV;
my $vcf = shift @ARGV;
my $outname = shift @ARGV;

# read in fasta file with gene sequences
my $in  = Bio::SeqIO->new(-file => "$original" , '-format' => 'Fasta');
my $out = Bio::SeqIO->new('-format' => 'Fasta');

    open (my $fh2, $vcf) or die "Error, cannot open file $vcf";
            my @vcf= <$fh2>;
    close ($fh2);

my $pos2;

while ( my $seq = $in->next_seq() ) {
    my $id = $seq->id;
    my $sequence = $seq->seq(); # get the sequence from the fasta file

    # Search sequence in the vcf file and get the position of the SNP   
    foreach my $vcfline(@vcf){
        if($vcfline =~ /$id/){
        if($vcfline !~ /^#/){
            $vcfline=~ s/\R//g;
            my @vcfline= split(' ', $vcfline);
            my $comp= $vcfline[0];
            my $pos= $vcfline[1];
            my $REF= $vcfline[2];

            my $pos2=$pos-1; # correct position
# mutate the sequence
            my $seq3=substr($sequence,$pos2,1,$REF);
open(OUT, ">> $outname");
print OUT
"$id\t$seq3\n";
close OUT;
}}}}

This currently only prints out a file with sequence ID and the new nucleotide (taken from the 4th column in the nucleotide change file), but I want the new sequence incorporating the nucleotide change.

I'm sorry I know very little Perl and only just started using Bioperl, so would really appreciate some guidance on how to change this script. Even better if the output could be in fasta format? I have only managed to get this far as I am adapting someone else's script! Thanks.


Solution

  • You're getting this result because substr returns only the replaced value, not the entire string in which it made the replacement. Quite simply, you don't need to store the return value of substr in $seq3 since (as you've found out) it just duplicates what is in $REF: just print $sequence instead.

    print OUT "$id\t$sequence\n";