Search code examples
bioinformaticsbiopython

Replacing PDB atom names with Biopython


I have a pdb containing a protein and one single ligand. I do not like how the ligand's hydrogens are named (1H2, 2H2, 1H3, 2H3, ...) and I would like something like H1, H2, H3, H4, ...

I wrote the following script, the problem is that it seems it's not possible to assign new atom ids. The change is reflected by Atom.id, but this change is not present in the output pdb structure, which retains the old names.

from Bio.PDB import PDBParser, PDBIO

io = PDBIO()
target_pdb_path = 'mypdb.pdb'
pdb = PDBParser(QUIET=True).get_structure('target', target_pdb_path)[0]

hydrogens = []
for atom in pdb.get_atoms():
    if atom.parent.id[0].startswith('H_'):
        # The atom is an hydrogen and is an HETATM record
        if 'H' in atom.name:
            hydrogens.append(atom)

# Rename hydrogens of the ligand
for h_num, h in enumerate(hydrogens, 1):
    # this is working, but the change is not present in the output pdb structure
    h.id = f'H{h_num} '

io.set_structure(pdb)
io.save('test.pdb')

See how assigning h.id did not change the full_id. I tried also replacing the whole full_id tuple, but also this does not work. Unfortunately it seems there's no method to change the id just like other features as Atom.set_bfactor(), Atom.set_coord() etc.

In [233]: h.full_id  # still the old id 3H18
Out[233]: ('target', 0, 'X', ('H_EOD', 401, ' '), ('3H18', ' '))

In [234]: h.id  # my new id
Out[234]: 'H29 '

Does anyone have a solution? Many thanks!


Solution

  • Given @nannarito concerns, I tried to find a way that doesn't need creating new Atom objects, I wasn't able to get a grasp of what goes behind the wheels of Biopython PDB module (I tried but it eludes me). After various attempts I ended up with the following code:

    from Bio.PDB import PDBParser, PDBIO
    
    
    
    target_pdb_path = 'small_pdb_h_gtp_no-connect_numb.pdb'
    
    
    pdb = PDBParser(QUIET=True).get_structure('target', target_pdb_path)
    
    hydrogens = []
    for atom in pdb[0].get_atoms():
        
        if atom.parent.get_resname() == 'GTP' :
        
            if atom.parent.id[0].startswith('H_'):
                
                print(atom.parent.id ,  atom.name)
                
                # The atom is an hydrogen and is an HETATM record
                if 'H' in atom.name:
                    
                    print('ok')
                    
                    hydrogens.append(atom)
                
    print('\n\nhydrogens : \n ', hydrogens,'\n\n')            
    
    for h_num, h in enumerate(hydrogens, 1):
        
        
        setattr( h , 'fullname' , f'W{h_num}'  )  ## or  h.fullname = f'W{h_num}'  
        
        print(h.serial_number, h.name , h.id , h.full_id , h.level , h.parent)
    
    
    
    atoms = pdb.get_atoms()
    
    for h in atoms :
        
        print(h.serial_number, h.name , h.id , h.full_id , h.level , h.parent)
    
    io = PDBIO()
    io.set_structure(pdb)
    io.save('test_new_approach.pdb', preserve_atom_numbering = False)
    
    

    Using same input of answer above I get as output file test_new_approach.pdb:

    ......
    .....
    ....
    ...
    HETATM  147  C5  GTP A 180      20.554  34.737 -11.307  1.00  0.00           C  
    HETATM  148  C6  GTP A 180      19.183  34.712 -11.659  1.00  0.00           C  
    HETATM  149  O6  GTP A 180      18.205  34.448 -10.957  1.00  0.00           O  
    HETATM  150  N7  GTP A 180      21.168  34.483 -10.079  1.00  0.00           N  
    HETATM  151  C8  GTP A 180      22.443  34.655 -10.325  1.00  0.00           C  
    HETATM  152  N9  GTP A 180      22.724  35.005 -11.630  1.00  0.00           N  
    HETATM  153  W1  GTP A 180      27.642  33.664 -10.448  1.00  0.00           H  
    HETATM  154  W2  GTP A 180      26.472  32.436 -10.894  1.00  0.00           H  
    HETATM  155  W3  GTP A 180      26.872  34.003 -12.692  1.00  0.00           H  
    HETATM  156  W4  GTP A 180      27.038  36.109 -10.945  1.00  0.00           H  
    HETATM  157  W5  GTP A 180      26.303  36.091 -13.672  1.00  0.00           H  
    HETATM  158  W6  GTP A 180      24.683  36.247 -10.440  1.00  0.00           H  
    HETATM  159  W7  GTP A 180      24.926  37.660 -12.845  1.00  0.00           H  
    HETATM  160  W8  GTP A 180      23.874  35.594 -13.231  1.00  0.00           H  
    HETATM  161  W9  GTP A 180      18.670  35.593 -15.377  1.00  0.00           H  
    HETATM  162  W10 GTP A 180      20.293  35.851 -15.834  1.00  0.00           H  
    HETATM  163  W11 GTP A 180      27.124  35.555  -7.891  1.00  0.00           H  
    HETATM  164  W12 GTP A 180      26.059  32.241  -5.339  1.00  0.00           H  
    HETATM  165  W13 GTP A 180      22.030  35.588 -14.193  1.00  0.00           H  
    HETATM  166  W14 GTP A 180      30.718  31.035  -4.497  1.00  0.00           H  
    HETATM  167  W15 GTP A 180      23.174  34.539  -9.606  1.00  0.00           H  
    TER     168      GTP A 180                                                       
    END   
    

    So setattr( h , 'fullname' , f'W{h_num}' ) seems to do the trick, but for some reasons inexplicable to me, this:

    atoms = pdb.get_atoms()
    
    for h in atoms :
        
        print(h.serial_number, h.name , h.id , h.full_id , h.level , h.parent)
    

    bit, used after the setattr( h , 'fullname' , f'W{h_num}' ) before or after the:

    io = PDBIO()
    io.set_structure(pdb)
    

    still produces:

    ......
    .....
    ....
    ...
    295 H10 H10 ('target', 0, 'A', ('H_GTP', 180, ' '), ('H10', ' ')) A <Residue GTP het=H_GTP resseq=180 icode= >
    296 H11 H11 ('target', 0, 'A', ('H_GTP', 180, ' '), ('H11', ' ')) A <Residue GTP het=H_GTP resseq=180 icode= >
    297 H12 H12 ('target', 0, 'A', ('H_GTP', 180, ' '), ('H12', ' ')) A <Residue GTP het=H_GTP resseq=180 icode= >
    298 H13 H13 ('target', 0, 'A', ('H_GTP', 180, ' '), ('H13', ' ')) A <Residue GTP het=H_GTP resseq=180 icode= >
    299 H14 H14 ('target', 0, 'A', ('H_GTP', 180, ' '), ('H14', ' ')) A <Residue GTP het=H_GTP resseq=180 icode= >
    300 H15 H15 ('target', 0, 'A', ('H_GTP', 180, ' '), ('H15', ' ')) A <Residue GTP het=H_GTP resseq=180 icode= >
    
    

    ... the original question gets more intriguing.

    I think I got something more:

    needed to add h.name = f'W{h_num}' too, like:

    h.fullname = f'W{h_num}'  # or setattr( h , 'fullname' , f'W{h_num}'  ) 
        
    h.name = f'W{h_num}'  # or setattr( h , 'name' , f'W{h_num}')
    

    to have:

    ......
    .....
    ....
    ...
    299 W14 H14 ('target', 0, 'A', ('H_GTP', 180, ' '), ('H14', ' ')) A <Residue GTP het=H_GTP resseq=180 icode= >
    ('target', 0, 'A', ('H_GTP', 180, ' '), ('W14', ' '))
    300 W15 H15 ('target', 0, 'A', ('H_GTP', 180, ' '), ('H15', ' ')) A <Residue GTP het=H_GTP resseq=180 icode= >
    ('target', 0, 'A', ('H_GTP', 180, ' '), ('W15', ' '))
    

    when using:

    print(h.serial_number, h.name , h.id , h.full_id , h.level , h.parent)
        
    print(h.get_full_id())
    
    

    so h.get_full_id() returns the updated atom name

    but still get old h.id when using:

    atoms = pdb.get_atoms()
    
    for h in atoms :
        
        print(h.serial_number, h.name , h.id , h.full_id , h.level , h.parent)
    

    .. so to me it's like something is going on on the Structure object and its parent/child relationships, since it can be corrected by using:

    par = h.parent
        
    par.detach_child(h.id)
    
    h.fullname = f'W{h_num}'  # or setattr( h , 'fullname' , f'W{h_num}'  ) 
    
    h.name = f'W{h_num}'  # or setattr( h , 'name' , f'W{h_num}')
    
    par.add(h)
    

    but still atom.id doesn't get reinitialized, see:

    atoms = pdb.get_atoms()
    
    for h in atoms :
        
        print(h.serial_number, h.name , h.id , h.full_id , h.level , h.parent)
    

    Output:

    299 W14 H14 ('target', 0, 'A', ('H_GTP', 180, ' '), ('W14', ' ')) A <Residue GTP het=H_GTP resseq=180 icode= >
    300 W15 H15 ('target', 0, 'A', ('H_GTP', 180, ' '), ('W15', ' ')) A <Residue GTP het=H_GTP resseq=180 icode= >
    

    In the end, try using:

    par = h.parent
        
    par.detach_child(h.id)
        
    h.fullname = f'W{h_num}'  
    
    h.name = f'W{h_num}' 
        
    h.id = f'W{h_num}' 
        
    par.add(h)
    

    ... but cannot guarantee that something somewhere isn't wrong.