Search code examples
pythonstringsubstringcombinationspython-itertools

How to generate subpeptides (special combinations) from a string representing a cyclic peptide?


Here is my problem: I have a sequence representing a cyclic peptide and I'm trying to create a function that generate all possible subpeptides. A subpeptide is created when bonds between 2 aminoacids are broken. For example: for the peptide 'ABCD', its subpeptides would be 'A', 'B', 'C', 'D', 'AB', 'BC', 'CD', 'DA', 'ABC', 'BCD', 'CDA', DAB'. Thus, the amount of possible subpeptides from a peptide of length n will always be n*(n-1). Note that not all of them are substrings from peptide ('DA', 'CDA'...).

I've written a code that generate combinations. However, there are some excessive elements, such as not linked aminoacids ('AC', 'BD'...). Does anyone have a hint of how could I eliminate those, since peptide may have a different length each time the function is called? Here's what I have so far:

def Subpeptides(peptide): 
    subpeptides = []
    from itertools import combinations
    for n in range(1, len(peptide)):
        subpeptides.extend(
    [''.join(comb) for comb in combinations(peptide, n)]
    )
    return subpeptides

Here are the results for peptide 'ABCD':

['A', 'B', 'C', 'D', 'AB', 'AC', 'AD', 'BC', 'BD', 'CD', 'ABC', 'ABD', 'ACD', 'BCD']

The order of aminoacids is not important, if they represent a real sequence of the peptide. For example, 'ABD' is a valid form of 'DAB', since D and A have a bond in the cyclic peptide.

I'm using Python.


Solution

  • it's probably easier to just generate them all:

    def subpeptides(peptide):
        l = len(peptide)
        looped = peptide + peptide
        for start in range(0, l):
            for length in range(1, l):
                print(looped[start:start+length])
    

    which gives:

    >>> subpeptides("ABCD")
    A
    AB
    ABC
    B
    BC
    BCD
    C
    CD
    CDA
    D
    DA
    DAB
    

    (if you want a list instead of printing, just change print(...) to yield ... and you have a generator).

    all the above does is enumerate the different places the first bond could be broken, and then the different products you would get if the next bond broke after one, two, or three (in this case) acids. looped is just an easy way to avoid having the logic of going "round the loop".