Search code examples
pythonbioinformaticsbiopythonpdb

Extract multiple chains from the CIF file (based on specified selections) and save them as one PDB


I have several CIF files, for each of them I'd like to extract 3 chains and save the three chains in one PDB file. Problem could be that some CIF structures can contain different chain-IDs of my desired chains: I solved it the way that I parsed RCSB webpage with Beautiful Soup and save the chain-IDs for each CIF file in corresponding yaml file. Then I open each CIF file as well as its yaml file, then I get correct chainIDs for each CIF from its yaml file.

Now I'd like to extract all three chains and save them. I tried to use something from similar questions this. However, the last line of my code (with io.save) returns error: IndexError: tuple index out of range

code:

cif_input = "AAA","BBB","CCC"
for ID in cif_input:
    cif_file = "{}.cif" .format(ID) #these are my CIF files
    
    yaml_file_protein = "{}.yaml" .format(ID) #these are yaml files for each of the CIF files, they contain information about chain IDs
    with open(yaml_file_protein, "r") as file_protein:
        proteins = yaml.load(file_protein, Loader=yaml.FullLoader) #loading the yaml file
        chain_1 = proteins["chain1"] #here I get information what ID has chain no.1, result in AAA.cif is for example: U
        chain_2 = proteins["chain2"] #here I get information what ID has chain no.2
        chain_3 = proteins["chain3"] #here I get information what ID has chain no.3
                
        structure = parser.get_structure("{}" .format(ID), "{}.cif" .format(ID))[0] #parsing corresponding structure (CIF file)
        
        #this is what I tried implement from similar questions: I need to extract all three chains and save them in one PDB, this have to be done for all CIF files
        class ChainSelect(Select):
            def accept_chain(self, chain):
                if chain.get_id()=='{}'.format(chain_1): #this is what I should select chainID of chain_1 based on what is specified above from the yaml file
                    return True
                if chain.get_id()=='{}'.format(chain_2):
                    return True
                if chain.get_id()=='{}'.format(chain_3):
                    return True
                else:
                    return False
        
        io = PDBIO()
        io.set_structure(structure)
        io.save("{}_selection.pdb" .format(ID), ChainSelect(chain))

I assume the problem will be somewhere in the "class ChainSelect..." but can't figure out what exactly could be causing such error. I'd be very helpful for any suggestions.

Edit: I also thought the problem could be in lines if chain.get_id()=='{}'.format(chain_X):, because it'd return e.g. U and not "U", thus I tried to edit it as if chain.get_id()==" '{}' ".format(chain_X): which return "U" , however the error is still the here. I tried this code with only one chain as well, but still the same error...


Solution

  • this code of mine works, reads '2ms2.pdb' , pdb that has Chains A,B,C

    and saves "prova.pdb" with only Chains A,C. Without headers

    
    from Bio.PDB import PDBParser, PDBIO, Select
    
    
    
    parser=PDBParser(QUIET=True)
    
    
    structure_1 = parser.get_structure('test', '2ms2.pdb')
    
    
    
    for model in structure_1:
        
        print(model)
        
        for chain in model:
            
            print(chain, chain.get_id(), type(chain.get_id()))
            
            class ChainSelect(Select):
                        def accept_chain(self, chain):
                            if chain.get_id() == 'A':
                                return True
                        
                            if chain.get_id() == 'C':
                                return True
                            else:
                                return False
            
    io = PDBIO()
    io.set_structure(structure_1)
    io.save("prova.pdb" , ChainSelect())
    
    

    not sure where best place to put

    class ChainSelect(Select):
                def accept_chain(self, chain):
                    if chain.get_id() == 'A':
                        return True
                
                    if chain.get_id() == 'C':
                        return True
                    else:
                        return False
    

    in this code though

    Instruction here: https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ

    Can I write PDB files?

    Use the PDBIO class for this. It’s easy to write out specific parts of a structure too, of course.

    Example: saving a structure

    io = PDBIO()
    io.set_structure(s)
    io.save("out.pdb")
    

    If you want to write out a part of the structure, make use of the Select class (also in PDBIO). Select has four methods:

    accept_model(model)
    accept_chain(chain)
    accept_residue(residue)
    accept_atom(atom)
    

    By default, every method returns 1 (which means the model/chain/residue/atom is included in the output). By subclassing Select and returning 0 when appropriate you can exclude models, chains, etc. from the output. Cumbersome maybe, but very powerful. The following code only writes out glycine residues:

    class GlySelect(Select):
        def accept_residue(self, residue):
            if residue.get_name() == "GLY":
                return 1
            else:
                return 0
    
    
    io = PDBIO()
    io.set_structure(s)
    io.save("gly_only.pdb", GlySelect())
    

    If this is all too complicated for you, the Dice module contains a handy extract function that writes out all residues in a chain between a start and end residue.