Search code examples
pythonpandasnumpyloopspython-itertools

Python script to sum values according to conditions in a loop


I need to sum the value contained in a column (column 9) if a condition is satisfied: the condition is that it needs to be a pair of individuals (column 1 and column 3), whether they are repeated or not.

My input file is made this way:

Sindhi_HGDP00171    0   Tunisian_39T    0   1   120437718   147097266   3.02    7.111
Sindhi_HGDP00183    1   Sindhi_HGDP00206    2   1   242708729   244766624   7.41    3.468
Sindhi_HGDP00183    1   Sindhi_HGDP00206    2   1   242708729   244766624   7.41    4.468
IBS_HG01768 2   Moroccan_MRA46  1   1   34186193    36027711    30.46   3.108
IBS_HG01710 1   Sardinian_HGDP01065 2   1   246117191   249120684   7.53    3.258
IBS_HG01768 2   Moroccan_MRA46  2   1   34186193    37320967    43.4    4.418

Therefore for instance, I would need the value of column 9 for each pair to be summed. Some of these pairs appear multiple time, in this case I would need the sum of value in column 9 betweem IBS_HG01768 and Moroccan_MRA46, and the sum of the value between Sindhi_HGDP00183 and Sindhi_HGDP00206. Some of these pairs are not repeated but I still need them to appear in the final results.

What I manage so far is to sum by group (population), so I sum column 9 value by pair of population like Sindhi and Tunisian for instance. I need to do the sum by pairs of Individuals.

My script is this:

import pandas as pd
import numpy as np
import itertools
 
# defines columns names
cols = ['ID1', 'HAP1', 'ID2', 'HAP2', 'CHR', 'STARTPOS', 'ENDPOS', 'LOD', 'IBDLENGTH']
 
# loads data (the file needs to be in the same folder where the script is)
data = pd.read_csv("./Roma_Ref_All_sorted.txt", sep = '\t', names = cols)

# removes the sample ID for ID1/ID2 columns and places it in two dedicated columns
data[['ID1', 'ID1_samples']] = data['ID1'].str.split('_', expand = True)
data[['ID2', 'ID2_samples']] = data['ID2'].str.split('_', expand = True)

# gets the groups list from both ID columns...
groups_id1 = list(data.ID1.unique()) 
groups_id2 = list(data.ID2.unique())
groups = list(set(groups_id1 + groups_id2))
# ... and all the possible pairs 
group_pairs = [i for i in itertools.combinations(groups, 2)]
# subsets the pairs having Roma
group_pairs_roma = [x for x in group_pairs if ('Roma' in x[0] and x[0] != 'Romanian') or 
                    ('Roma' in x[1] and x[1] != 'Romanian')]

# preapres output df
result = pd.DataFrame(columns = ['ID1', 'ID2', 'IBD_sum'])
 
# loops all the possible pairs and computes the sum of IBD length
for idx, group_pair in enumerate(group_pairs_roma):
    id1 = group_pair[0]
    id2 = group_pair[1]
    ibd_sum = round(data.loc[((data['ID1'] == id1) & (data['ID2'] == id2)) |
                              ((data['ID1'] == id2) & (data['ID2'] == id1)), 'IBDLENGTH'].sum(),3)
    result.loc [idx, ['ID1', 'ID2', 'IBD_sum']] = [id1, id2, ibd_sum]
 
# saves results
result.to_csv("./groups_pairs_sum_IBD.txt", sep = '\t', index = False)

My current output is something like this:

ID1 ID2 IBD_sum
Sindhi  IBS 3.275
Sindhi  Moroccan    74.201
Sindhi  Sindhi  119.359

While I need something like:

ID1 ID2 IBD_sum
Sindhi_individual1 Moroccan_individual1 3.275
Sindhi_individual2 Moroccan_individual2 5.275
Sindhi_individual3  IBS_individual1 4.275

I have tried by substituting one line in my code, by writing

groups_id1 = list(data.ID1_samples.unique()) 
groups_id2 = list(data.ID2_samples.unique())

and later

ibd_sum = round(data.loc[((data['ID1_samples'] == id1) & (data['ID2_samples'] == id2)) |
                              ((data['ID1_samples'] == id2) & (data['ID2_samples'] == id1)), 'IBDLENGTH'].sum(),3)

Which in theory should work because I set the individuals as pairs instead of populations as pairs, but the output was empty. What could I do to edit the code for what I need?


Solution

  • I have solved the problem on my own but using R language. This is the code:

    ibd <- read.delim("input.txt", sep='\t')
    
    ibd_sum_indv <- ibd %>%
      group_by(ID1, ID2) %>%
      summarise(SIBD = sum(IBDLENGTH),
                NIBD = n()) %>% 
      ungroup()