Search code examples
pythonpython-3.xmemorybiopython

Biopython pairwise alignment results in segmentation fault when run in loop


I am trying to run pairwise global alignment method in biopython in loop for about 10000 pair of strings. Each string on an average is 20 characters long. Running the method for a single pair of sequences works fine. But running this in a loop, for as low as 4 pairs, results in segmentation fault. How can this be solved?

from Bio import pairwise2
def myTrial(source,targ):

     if source == targ:
         return [source,targ,source]

     alignments = pairwise2.align.globalmx(source, targ,1,-0.5)
     return alignments
sour = ['najprzytulniejszy', 'sadystyczny', 'wyrzucić', 'świat']
targ = ['najprzytulniejszym', 'sadystycznemu', 'wyrzucisz', 'świat']
for i in range(4):
   a = myTrial(sour[i],targ[i])

Solution

  • The segmentation fault isn't happening because you are using a loop, but because you are providing non-ASCII characters as input for an alignment mode that takes ASCII string inputs only. Luckily, Bio.pairwise2.align.globalmx also permits aligning lists that contain arbitrary strings of ASCII and non-ASCII characters as tokens(i.e. aligning lists of strings, such as ['ABC', 'ABD'] with ['ABC', 'GGG'] to produce alignments like

    ['ABC', 'ABD', '-'  ]
    ['ABC', '-'  , 'GGG']
    

    or in your case, aligning lists of non-ASCII characters such as ['ś', 'w', 'i', 'a', 't'] and ['w', 'y', 'r', 'z', 'u', 'c', 'i', 's', 'z'] to produce alignments like

    ['ś', 'w', '-', '-', '-', '-', '-', 'i', 'a', 't', '-', '-']
    ['-', 'w', 'y', 'r', 'z', 'u', 'c', 'i', '-', '-', 's', 'z']
    

    To accomplish this with Biopython, in your code, replace

    alignments = pairwise2.align.globalmx(source, targ,1,-0.5)
    

    with

    alignments = pairwise2.align.globalmx(list(source), list(targ), 1, -0.5, gap_char=['-'])
    

    So for an input of

    source = 'świat'
    targ = 'wyrzucisz'
    

    the modified code will produce

    [(['ś', 'w', '-', '-', '-', '-', '-', 'i', 'a', 't', '-', '-'],
      ['-', 'w', 'y', 'r', 'z', 'u', 'c', 'i', '-', '-', 's', 'z'],
      2.0,
      0,
      12)]
    

    instead of a segmentation fault.

    And since each token in the list is only one character long, you can also convert the resulting aligned lists back into strings using:

    new_alignment = []
    
    for aln in alignment:
        # Convert lists back into strings
        a = ''.join(aln[0])
        b = ''.join(aln[1])
    
        new_aln = (a, b) + aln[2:]
        new_alignment.append(new_aln)
    

    In the above example, new_alignment would then be

    [('św-----iat--', '-wyrzuci--sz', 2.0, 0, 12)]
    

    as desired.