Search code examples
pythonregexpython-rebiopythonfuzzy-search

finding CDRs in NGS data


I have millions of sequences in fasta format and want to extract CDRs (CDR1, CDR2 and CDR3).I chose only one sequence as an example and tried to extract CDR1 but not able to extract CDR1.

sequence:-'FYSHSAVTLDESGGGLQTPGGGLSLVCKASGFTFSSYGMMWVRQAPGKGLEYVAGIRNDA GDKRYGSAVQGRATISRDNGQSTVRLQLNNLRAEDTGTYFCAKESGCYWDSTHCIDAWGH GTEVIVSTGG'.

cdr1 starts from:- 'VCKASGFTFS', with maximum three replacements but C at 2nd place is must. cdr1 ends at:-'WVRQAP', with maximum two replacements but R at 3rd place is must.

Extracted cdr1 should be SYGMM

def cdr1_in(cdr_in): #VCKASGFTFS
    pin=0
    max_pin=3       
    
    if cdr[1]!='C':
        pin+=1
    if cdr[0]!='V':
        pin+=1
    if cdr[2]!='K':
        pin+=1
    if cdr[3]!='A':
        pin+=1    
    if cdr[4]!='S':
        pin+=1
    if cdr[5]!='G':
        pin+=1
    if cdr[6]!='F':
        pin+=1
    if cdr[7]!='T':
        pin+=1    
    if cdr[8]!='F':
        pin+=1
    if cdr[9]!='S':
        pin+=1   
  
    if pin<max_pin:
        print('CDR_in pattern', cdr_in)
        # print('CDR_starts from', arr.index(cdr_in)+9)
        return (arr.index(cdr_in)+9)
 
    def cdr1_out(cdr_out):#WVRQAP
    
        pin=0
        max_pin=2            
        if cdr[1]!='V':
            pin+=1
        if cdr[0]!='W':
            pin+=1
        if cdr[2]!='R':
            pin+=1
        if cdr[3]!='Q':
            pin+=1    
        if cdr[4]!='A':
            pin+=1
        if cdr[5]!='P':
            pin+=1
            
        if pin<max_pin:
            # print('CDR_in pattern', cdr_out)
            # print('CDR_ends at', arr.index(cdr_out))
            return (arr.index(cdr_out))
 

K=10
arr=sequence
for i in range(len(arr)-k+1):
        slider=arr[i:k+i]
        print("CDR_1 is:", arr[cdr1_in(slider): cdr1_out(slider)])        
          

Solution

  • I got it by the following method which works absolutely fine for me to find CDR1,2 and 3. All I need to define 3 different dictionaries having the definition ie prefix, suffix, max pin, fix position and pass them to the following code.

    Here I have performed this to find the CDR1, which gives me the desired output.

    dictionary_1={
                'cdr1_in': 'VCKASGFTFS',
                'cdr1_out':'WVRQAP',
                'max_pin_in':3,
                'position_c':2,
                'max_pin_out':2,
                'position_r':3
        
                }
    
    
    def cdr_out(dictionary_1,x):
        count=0
        for i in range(len(x)-5):
            rider=x[i:i+6]
            # print(rider)
            if rider[2]=='R':
                # print(rider)
                for i in range(len(dictionary_1['cdr1_out'])):
                    if rider[i]!=dictionary_1['cdr1_out'][i]:
                        count+=1
                if count<dictionary_1['max_pin_out']:
                    # print(rider)
                    return x.index(rider)
    
    def cdr_in(dictionary_1,x):
        count=0            
        for i in range(len(x)-9):
            rider=x[i:i+10]
            # print(rider)
            if rider[1]=='C':
                # print(rider)
                for i in range(len(dictionary_1['cdr1_in'])):
                    if rider[i]!=dictionary_1['cdr1_in'][i]:
                        count+=1
                if count<dictionary_1['max_pin_in']:
                    # print(rider)
                    y=x.index(rider)
                    z=cdr_out(dictionary_1,x)
                    cdr=x[y+10:z]
                    return cdr
                
    print(cdr_in(dictionary_1,x))
    

    SYGMM