Search code examples
rprotein-database

Extract multiple protein chains from single PDB file


I have a PDB file that contains multiple chains, though no chainid's. I would like to use R to assign chainid's so that I can analyze individual protein chains and find specific sites within each.

I am currently using Rpdb to extract the files and example data (top few lines of each chain from a single pdb file) are below.

REMARK  99  Chain ID : 1
REMARK  99  Residues : 593
REMARK  99  Atoms    : 4782
REMARK  99  File     : final.sc.pdb
ATOM      1  N   MET     1      17.471 -55.657  42.605  1.00  0.00              
ATOM      2  CA  MET     1      17.516 -55.479  41.136  1.00  0.00              
ATOM      3  CB  MET     1      16.328 -56.188  40.460  1.00  0.00              
ATOM      4  C   MET     1      17.525 -54.045  40.745  1.00  0.00              
ATOM      5  O   MET     1      17.991 -53.186  41.492  1.00  0.00              
ATOM      6  CG  MET     1      14.961 -55.764  41.001  1.00  0.00           C  
ATOM      7  SD  MET     1      14.550 -56.460  42.632  1.00  0.00           S  
ATOM      8  CE  MET     1      12.951 -55.613  42.782  1.00  0.00           C  
ATOM      9  N   THR     2      17.012 -53.760  39.535  1.00  0.00              
ATOM     10  CA  THR     2      16.993 -52.420  39.040  1.00  0.00              
ATOM     11  CB  THR     2      16.552 -52.347  37.612  1.00  0.00                         
TER
REMARK  99  Chain ID : 1
REMARK  99  Residues : 531
REMARK  99  Atoms    : 4211
REMARK  99  File     : final.sc.pdb
ATOM      1  N   MET     1      55.179  17.162   2.445  1.00  0.00              
ATOM      2  CA  MET     1      55.489  16.069   3.613  1.00  0.00              
ATOM      3  CB  MET     1      55.199  16.623   5.019  1.00  0.00              
ATOM      4  C   MET     1      53.890  15.434   3.310  1.00  0.00              
ATOM      5  O   MET     1      52.902  15.782   3.971  1.00  0.00              
ATOM      6  CG  MET     1      56.062  17.833   5.341  1.00  0.00           C  
ATOM      7  SD  MET     1      55.937  18.517   7.006  1.00  0.00           S  
ATOM      8  CE  MET     1      56.886  17.217   7.874  1.00  0.00           C  
ATOM      9  N   ALA     2      53.854  14.445   2.424  1.00  0.00              
ATOM     10  CA  ALA     2      52.895  13.660   2.231  1.00  0.00              
ATOM     11  CB  ALA     2      53.134  12.918   0.924  1.00  0.00              
ATOM     12  C   ALA     2      52.253  12.986   3.391  1.00  0.00              
ATOM     13  O   ALA     2      51.034  12.834   3.347  1.00  0.00  
TER  

Column names are added by Rpdb as (note: chainid, insert and segid have no values):

recname eleid elename alt resname chainid resid insert     x1      x2     x3 occ temp segid

Does anyone know a way to add in said chainid's? Thanks!


Solution

  • By using "TER" to define the beginning and end of the protein chains, I was able to make something work for now, but if anyone has a better/smoother/faster way please let me know:

    #works for pdb file with two chains
    pdb.input.table=read.delim(file.choose(),sep="",header=F)
    
    #pdb chain splitting
    chainAstart=1
    chainAend=which(pdb.input.table=="TER")[1]
    chainBstart=which(pdb.input.table=="TER")[1]+1
    chainBend=which(pdb.input.table=="TER")[2]
    
    new.chain.id=c(rep("A",chainAend),rep("B",chainBend-chainAend))
    
    pdb.dock.input=cbind(pdb.input.table,new.chain.id)