Search code examples
pythonpython-3.xbioinformaticsbiopython

Is there a way to separate the chains belong to each Biological assembly in a PDB file?(python Script)


I want to separate the chain IDs which belong to specific Biological assemblies in a PDB file. As an Example PDB ID 1BRS has 3 Biological assemblies Biological assembly 1 : - chains A and D Biological assembly 2 : - Chains B and E Biological assembly 3 : - chains C and F

Is there a way (python Script) to get the Chain IDs separate which belong to each biological assembly as follows 1BRS_A:D 1BRS_B:E 1BRS_C:F No need to extract the chain coordinates. If I get the chain names, that will be enough. Thanks in advance


Solution

  • The PDBx/mmCIF file format contains the information in the _pdbx_struct_assembly_gen category.

    loop_
    _pdbx_struct_assembly_gen.assembly_id 
    _pdbx_struct_assembly_gen.oper_expression 
    _pdbx_struct_assembly_gen.asym_id_list 
    1 1 A,D,G,J 
    2 1 B,E,H,K 
    3 1 C,F,I,L 
    

    These files can be read e.g. with Biotite (https://www.biotite-python.org/), a package I am developing. The categories can be read in a dictionary-like manner:

    import biotite.database.rcsb as rcsb
    import biotite.structure as struc
    import biotite.structure.io.pdbx as pdbx
    
    ID = "1BRS"
    
    # Download structure
    file_name = rcsb.fetch(ID, "pdbx", target_path=".")
    
    # Read file
    file = pdbx.PDBxFile()
    file.read(file_name)
    # Get 'pdbx_struct_assembly_gen' category as dictionary
    assembly_dict = file["pdbx_struct_assembly_gen"]
    for asym_id_list in assembly_dict["asym_id_list"]:
        chain_ids = asym_id_list.split(",")
        print(f"{ID}_{':'.join(chain_ids)}")
    

    The output is

    1BRS_A:D:G:J
    1BRS_B:E:H:K
    1BRS_C:F:I:L
    

    The chains G-L contain only water molecules.

    EDIT:

    To include only chain IDs that belong to a polymer, e.g. a protein or a nucleotide, you can use the entity_poly category:

    loop_
    _entity_poly.entity_id 
    _entity_poly.type 
    _entity_poly.nstd_linkage 
    _entity_poly.nstd_monomer 
    _entity_poly.pdbx_seq_one_letter_code 
    _entity_poly.pdbx_seq_one_letter_code_can 
    _entity_poly.pdbx_strand_id 
    _entity_poly.pdbx_target_identifier 
    1 'polypeptide(L)' no no 
    ;AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTS
    GFRNSDRILYSSDWLIYKTTDHYQTFTKIR
    ;
    ;AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTS
    GFRNSDRILYSSDWLIYKTTDHYQTFTKIR
    ;
    A,B,C ? 
    2 'polypeptide(L)' no no 
    ;KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAE
    GADITIILS
    ;
    ;KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAE
    GADITIILS
    ;
    D,E,F ? 
    

    This is the updated Python code:

    import biotite.database.rcsb as rcsb
    import biotite.structure as struc
    import biotite.structure.io.pdbx as pdbx
    
    ID = "1BRS"
    
    # Download structure
    file_name = rcsb.fetch(ID, "pdbx", target_path=".")
    
    # Read file
    file = pdbx.PDBxFile()
    file.read(file_name)
    
    # Get 'entity_poly' category as dictionary
    # to find out which chains are polymers
    poly_chains = []
    for chain_list in file["entity_poly"]["pdbx_strand_id"]:
        poly_chains += chain_list.split(",")
    
    # Get 'pdbx_struct_assembly_gen' category as dictionary
    for asym_id_list in file["pdbx_struct_assembly_gen"]["asym_id_list"]:
        chain_ids = asym_id_list.split(",")
        # Filter chains that belong to a polymer
        chain_ids = [chain_id for chain_id in chain_ids if chain_id in poly_chains]
        print(f"{ID}_{':'.join(chain_ids)}")
    

    And this is the output:

    1BRS_A:D
    1BRS_B:E
    1BRS_C:F