Search code examples
mdanalysis

How to calculate center of mass of proteins using MDAnalysis?


I'm in a bit unusual situation. There are seven different proteins stored in a single file according to their residues names. Each protein has different sequence length. Now I need to calculate the center of mass of each protein and generate a time series data.I know how to do with a single protein, but do not with multiple protein system. For single protein I can do something like this:

import MDAnalysis as mda
import numpy as np

u = mda.Universe('lp400start.gro')
u1 = mda.Merge(u.select_atoms("not resname W and not resname WF and not resname ION"))
u1.load_new('lp400.xtc')
protein = u1.select_atoms("protein")
arr = np.empty((protein.n_residues, u1.trajectory.n_frames, 3))

for ts in u.trajectory:
    arr[:, ts.frame] = protein.center_of_mass(compound='residues')

the data files are publicly available here. The residue sequence in the topology file can be checked using grep "^ *1[A-Z]" -B1 lp400final.gro | grep -v "^ *1[A-Z]", the output:

 38ALA     BB   74  52.901  33.911   6.318
--
   38ALA     BB  148  41.842  29.381   7.211
--
  137GLY     BB  455  36.756   4.287   3.284
--
  137GLY     BB  762  44.721  60.377   3.112
--
  252HIS    SC3 1368  28.682  37.936   6.727
--
  252HIS    SC3 1974  18.533  46.506   6.314
--
  576PHE    SC3 3263  48.937  38.538   4.013
--
  576PHE    SC3 4552  18.513  25.948   3.800
--
 1092PRO    SC1 6470  42.510  40.992   6.775
--
 1092PRO    SC1 8388  14.709   4.759   6.370
--
 1016LEU    SC110524  57.264  56.308   2.632
--
 1016LEU    SC112660  50.716  14.698   2.728
--
 1285LYS    SC215345   0.793  33.529   1.509

First protein has sequence length of 38 residues and it has a copy of its own and then the second protein and so on. Now I want to have the COM of each protein at each time frame and build it into a time series. In addition to proteins topology file also contains the DPPC particles as well. Could someone help me how to do this? Thanks!

To make sure the output trajectory is correct it looks something like this enter link description here


Solution

  • I would load the system from the TPR file to maintain the bond information. Then MDAnalysis can determine fragments (namely, your proteins). Then loop over the fragments to determine the COM time series:

    import MDAnalysis as mda
    import numpy as np
    
    # files from https://doi.org/10.5281/zenodo.846428
    TPR = "lp400.tpr"
    XTC = "lp400.xtc"
    
    # build reduced universe to match XTC
    # (ignore warnings that no coordinates are found for the TPR)
    u0 = mda.Universe(TPR)
    u = mda.Merge(u0.select_atoms("not resname W and not resname WF and not resname ION"))
    u.load_new(XTC)
    
    # segments (exclude the last one, which is DPPC and not protein)
    protein_segments = u.segments[:-1]
    
    # build the fragments
    # (a dictionary with the key as the protein name -- I am using an
    # OrderedDict so that the order is the same as in the TPR)
    from collections import OrderedDict
    protein_fragments = OrderedDict((seg.segid[6:], seg.atoms.fragments) for seg in protein_segments)
    
    # analyze trajectory (with a nice progress bar)
    timeseries = []
    for ts in mda.log.ProgressBar(u.trajectory):
        coms = []
        for name, proteins in protein_fragments.items():
            # loop over all the different proteins;
            # unwrap to get the true COM under PBC (double check!!)
            coms.extend([p.center_of_mass(unwrap=True) for p in proteins]) 
        timeseries.append(coms)
    timeseries = np.array(timeseries)
    

    Note

    • Double check that unwrap=True is doing the right thing (and that it is necessary — I didn't check if any proteins were split across periodic boundaries).
    • Unwrapping is slow; if you don't need it, it will run faster.
    • The resulting array is a 3d array with shape (N_timesteps, M_proteins, 3), namely (10001, 14, 3).
    • The content of protein_fragments is
      OrderedDict([('EPHA', (<AtomGroup with 74 atoms>, <AtomGroup with 74 atoms>)),
                 ('OMPA', (<AtomGroup with 307 atoms>, <AtomGroup with 307 atoms>)),
                 ('OMPG', (<AtomGroup with 606 atoms>, <AtomGroup with 606 atoms>)),
                 ('BTUB', (<AtomGroup with 1289 atoms>, <AtomGroup with 1289 atoms>)),
                 ('ATPS', (<AtomGroup with 1918 atoms>, <AtomGroup with 1918 atoms>)),
                 ('GLPF', (<AtomGroup with 2136 atoms>, <AtomGroup with 2136 atoms>)),
                 ('FOCA', (<AtomGroup with 2685 atoms>, <AtomGroup with 2685 atoms>))])