Search code examples
pythondata-sciencecomputer-sciencebioinformaticsbiopython

Generate the all possible unique peptides (permutants) in Python/Biopython


I have a scenario in which I have a peptide frame having 9 AA. I want to generate all possible peptides by replacing a maximum of 3 AA on this frame ie by replacing only 1 or 2 or 3 AA.

The frame is CKASGFTFS and I want to see all the mutants by replacing a maximum of 3 AA from the pool of 20 AA.

we have a pool of 20 different AA (A,R,N,D,E,G,C,Q,H,I,L,K,M,F,P,S,T,W,Y,V).

I am new to coding so Can someone help me out with how to code for this in Python or Biopython.

output is supposed to be a list of unique sequences like below:

CKASGFTFT, CTTSGFTFS, CTASGKTFS, CTASAFTWS, CTRSGFTFS, CKASEFTFS ....so on so forth getting 1, 2, or 3 substitutions from the pool of AA without changing the existing frame.


Solution

  • Ok, so after my code finished, I worked the calculations backwards,

    Case1, is 9c1 x 19 = 171

    Case2, is 9c2 x 19 x 19 = 12,996

    Case3, is 9c3 x 19 x 19 x 19 = 576,156

    That's a total of 589,323 combinations.

    Here is the code for all 3 cases, you can run them sequentially.

    You also requested to join the array into a single string, I have updated my code to reflect that.

    import copy
    original = ['C','K','A','S','G','F','T','F','S']
    possibilities = ['A','R','N','D','E','G','C','Q','H','I','L','K','M','F','P','S','T','W','Y','V']
    storage=[]
    counter=1
    
    # case 1
    for i in range(len(original)):
        for x in range(20):
            temp = copy.deepcopy(original)
            if temp[i] == possibilities[x]:
                pass
            else:
                temp[i] = possibilities[x]
                storage.append(''.join(temp))
                print(counter,''.join(temp))
                counter += 1
    
    # case 2
    for i in range(len(original)):
        for j in range(i+1,len(original)):
            for x in range(len(possibilities)):
                for y in range(len(possibilities)):
                    temp = copy.deepcopy(original)
                    if temp[i] == possibilities[x] or temp[j] == possibilities[y]:
                        pass
                    else:
                        temp[i] = possibilities[x]
                        temp[j] = possibilities[y]
                        storage.append(''.join(temp))
                        print(counter,''.join(temp))
                        counter += 1
    
    # case 3
    for i in range(len(original)):
        for j in range(i+1,len(original)):
            for k in range(j+1,len(original)):
                for x in range(len(possibilities)):
                    for y in range(len(possibilities)):
                        for z in range(len(possibilities)):
                            temp = copy.deepcopy(original)
                            if temp[i] == possibilities[x] or temp[j] == possibilities[y] or temp[k] == possibilities[z]:
                                pass
                            else:
                                temp[i] = possibilities[x]
                                temp[j] = possibilities[y]
                                temp[k] = possibilities[z]
                                storage.append(''.join(temp))
                                print(counter,''.join(temp))
                                counter += 1
    

    The outputs look like this, (just the beginning and the end).

    The results will also be saved to a variable named storage which is a native python list.

    1 AKASGFTFS
    2 RKASGFTFS
    3 NKASGFTFS
    4 DKASGFTFS
    5 EKASGFTFS
    6 GKASGFTFS
    ...
    ...
    ...
    589318 CKASGFVVF
    589319 CKASGFVVP
    589320 CKASGFVVT
    589321 CKASGFVVW
    589322 CKASGFVVY
    589323 CKASGFVVV
    
    

    It takes around 10 - 20 minutes to run depending on your computer.

    It will display all the combinations, skipping over changing AAs if any one is same as the original in case1 or 2 in case2 or 3 in case 3.

    This code both prints them and stores them to a list variable so it can be storage or memory intensive and CPU intensive.

    You could reduce the memory foot print if you want to store the string by replacing the letters with numbers cause they might take less space, you could even consider using something like pandas or appending to a csv file in storage.

    You can iterate over the storage variable to go through the strings if you wish, like this.

    for i in storage:
        print(i)
    

    Or you can convert it to a pandas series, dataframe or write line by line directly to a csv file in storage.