Search code examples
pythonbioinformaticsxlrdbiopython

BioPython: Amino acid Sequence contains 'J' and can't calculate the molecular weight


The data which I am working with comes in an Excel file which has the amino acid sequence on index 1. I'm trying to calculate different attributes based on the sequence using BioPython. The code which I have now:

import xlrd
import sys
from Bio.SeqUtils.ProtParam import ProteinAnalysis

print '~~~~~~~~~~~~~~~ EXCEL PARSER FOR PVA/NON-PVA DATA ~~~~~~~~~~~~~~~'

print 'Path to Excel file:', str(sys.argv[1])
fname = sys.argv[1]
workbook = xlrd.open_workbook(fname, 'rU')

print ''
print 'The sheet names that have been found in the Excel file: '
sheet_names = workbook.sheet_names()
number_of_sheet = 1
for sheet_name in sheet_names:
    print '*', number_of_sheet, ':     ', sheet_name
    number_of_sheet += 1

with open("thefile.txt","w") as f:
    lines = []
    f.write('LENGTH.SEQUENCE,SEQUENCE,MOLECULAR.WEIGHT\n')
    for sheet_name in sheet_names:
        worksheet = workbook.sheet_by_name(sheet_name)
        print 'opened: ', sheet_name
        for i in range(1, worksheet.nrows):
            row = worksheet.row_values(i)
            analysed_seq = ProteinAnalysis(row[1].encode('utf-8'))
            weight = analysed_seq.molecular_weight()
            lines.append('{},{},{}\n'.format(row[2], row[1].encode('utf-8'), weight))
    f.writelines(lines)

It was working up until I added the calculation of the molecular weight. This showed that the following error:

Traceback (most recent call last):
  File "Excel_PVAdata_Parser.py", line 28, in <module>
    weight = analysed_seq.molecular_weight()
  File "/usr/lib/python2.7/dist-packages/Bio/SeqUtils/ProtParam.py", line 114, in molecular_weight
    total_weight += aa_weights[aa]
KeyError: 'J'

I looked in the Excel data file and this shows that the amino acid sequence does contain a J. Does someone know a package of BioPython which catches there 'unknown aminoacids' or have another suggestion?


Solution

  • As peterjc said, J is an ambiguous amino acid coding for either leucine (L) or isoleucine (I). Both have the same molecular weight:

    >>> from Bio.SeqUtils.ProtParam import ProteinAnalysis
    >>> ProteinAnalysis('L').molecular_weight()
    131.1729
    >>> ProteinAnalysis('I').molecular_weight()
    131.1729
    

    So you could temporarily replace all occurrences of J with either L or I for calculating the molecular weight.