Search code examples
pythonbioinformaticsbiopythonfastasequence-alignment

Replacing all of instances of a letter in a column of a FASTA alignment file


I am writing a script which can replace all of the instances of an amino acid residue in a column of a FASTA alignment file. Using AlignIO, I just can read an alignment file and extract information from it but I can't modify their sequences. Moreover, MutableSeq module just can be modify string sequences and if I use a seq object input, it can't modify it. I'd like to find a module or a method to modify an alignment file and save it, while it is in the structure of AlignIO as a sequence object for subsequent procedures.

My code using just AlignIO:

alignment = AlignIO.read(input_handle, "fasta")
for record in alignment:
    if record.seq[10] == "G":
        record.seq[10] = "T"

Output:

record.seq[10] = "T"
TypeError: 'Seq' object does not support item assignment

My code using both AlignIO and MutableSeq:

alignment = AlignIO.read(input_handle, "fasta")
for record in alignment[0:1, : ]:
    mut_align = MutableSeq(record.seq)
    mut_align.__delitem__(10)
    mut_align.insert(10, "T")
    print(mut_align)

Output:

del self.data[index]
TypeError: 'Seq' object doesn't support item deletion

Solution

  • This is a good question, I think what you're doing with MutableSeq should work or fail clearly right away, but instead here is a workaround:

    from Bio import AlignIO
    
    alignment = AlignIO.read('test.fasta', 'fasta')
    for record in alignment:
        record.seq = record.seq.tomutable()
        if record.seq[2] == "G":
            record.seq[2] = "T"
        print(record)
    

    Outputs:

    ID: 1
    Name: 1
    Description: 1
    Number of features: 0
    MutableSeq('ATTAG')
    ID: 2
    Name: 2
    Description: 2
    Number of features: 0
    MutableSeq('AATAG')
    

    For input data:

    $ cat test.fasta 
    >1
    ATGAG
    >2
    AAGAG
    

    I consider the fact that a MutableSeq object is created from a Seq object in your example but that the method fails as a bug, which I filed here: https://github.com/biopython/biopython/issues/1918


    Here is another rather inefficient workaround, re-building the string each time, if you want to avoid using MutableSeq all-together:

    alignment = AlignIO.read('test.fasta', 'fasta')
    for record in alignment:
        if record.seq[2] == "G":
            record.seq = record.seq[:2] + 'T' + record.seq[3:]
        print(record)