Search code examples
pythonbioinformatics

Counting and Displaying Interactions between Residues from a Large File in Python or Bash


I have a large file containing data about interactions between residues in a protein. Each line in the file represents an interaction, with the first column indicating the type of interaction (e.g., sb, pc, vdw, hb), and the second and third columns indicating the residues involved in the interaction.

Here's an example of the data format:

sb ASP-11 LYS-15
sb GLU-309 HIS-46
sb ASP-11 LYS-15
sb GLU-103 HIS-296
sb ARG-290 GLU-72
sb GLU-103 HIS-296
sb GLU-86 LYS-90
sb ASP-113 LYS-117
sb ASP-279 LYS-15
sb ASP-61 HIS-296
sb ARG-114 ASP-113
sb ASP-113 LYS-117
sb GLU-10 LYS-14
sb GLU-309 HIS-46
sb ASP-279 LYS-15
pc HIS-46 TYR-45
pc ARG-156 TYR-158
pc HIS-182 TRP-153
pc LYS-242 PHE-50
pc ARG-156 TYR-158
pc HIS-275 LYS-282
pc LYS-311 TRP-304
pc HIS-182 TRP-153
pc ARG-114 PHE-50
pc ARG-114 PHE-50
ps PHE-41 TYR-45
vdw ASP-270 LEU-273
vdw ASP-270 LYS-272
vdw ASP-270 LYS-272
vdw HIS-275 LEU-277
vdw LEU-273 LEU-277
vdw GLU-276 LYS-272
vdw GLU-276 LYS-272

I need to process this file to count the number of each type of interaction, along with the residues involved, in the following format:

sb    pc    vdw    hb    residue1    residue2
2     0     0      0     ASP-11      LYS-15

For the 'hb' interaction, there are several types (e.g., hbbb, hbbs), but I want to count all of them together under 'hb'.

I'm looking for a Python or Bash script to achieve this. Specifically, the script should:

  1. Read the data from the file.
  2. Count the occurrences of each type of interaction.
  3. Display the counts and the residues involved in the interactions in the specified format.

I tried the following script, but it didn't work:

# Define a dictionary to store the counts of each interaction type
interaction_counts = {}

# Open the file and read line by line
with open("file1.txt", "r") as file:
    for line in file:
        # Split the line into interaction type and residues
        interaction, residue1, residue2 = line.strip().split()

        # Construct a unique key for each interaction based on residue pair
        interaction_key = "-".join(sorted([residue1, residue2]))

        # Update the counts for this interaction type
        interaction_counts[interaction_key] = interaction_counts.get(interaction_key, {})
        interaction_counts[interaction_key][interaction] = interaction_counts[interaction_key].get(interaction, 0) + 1

# Print the header
print("sb\tpc\tvdw\thb\tresidue1\tresidue2")

# Iterate over the interaction counts dictionary and print formatted output
for interaction_key, counts in interaction_counts.items():
    residue1, residue2 = interaction_key.split("-")
    sb_count = counts.get("sb", 0)
    pc_count = counts.get("pc", 0)
    vdw_count = counts.get("vdw", 0)
    hb_count = counts.get("hb", 0)
    print(f"{sb_count}\t{pc_count}\t{vdw_count}\t{hb_count}\t{residue1}\t{residue2}")

It gave the following error:

sb  pc  vdw hb  residue1    residue2
Traceback (most recent call last):
  File "/home/count_interactions.py", line 22, in <module>
    residue1, residue2 = interaction_key.split("-")
    ^^^^^^^^^^^^^^^^^^
ValueError: too many values to unpack (expected 2)

I'm open to any suggestions or solutions in Python or Bash. Thank you!


Solution

  • Your error stems from misparsing some line – I'd assume there's e.g. an empty line somewhere.

    This would do the trick for your example data:

    import collections
    
    interactions = collections.defaultdict(collections.Counter)
    
    with open("file1.txt", "r") as file:
        for line in file:
            line = line.strip().split()
            if len(line) == 3:
                pair = frozenset(line[1:])
                interactions[pair][line[0]] += 1
    
    interaction_kinds = ["sb", "pc", "vdw", "hb"]
    print(*interaction_kinds, "residue1", "residue2", sep="\t")
    for pair, counts in interactions.items():
        r1, r2 = sorted(pair)
        counts = [counts.get(x, 0) for x in interaction_kinds]
        print(*counts, r1, r2, sep="\t")