I need to sum the value contained in a column (column 9) if a condition is satisfied: the condition is that it needs to be a pair of individuals (column 1 and column 3), whether they are repeated or not.
My input file is made this way:
Sindhi_HGDP00171 0 Tunisian_39T 0 1 120437718 147097266 3.02 7.111
Sindhi_HGDP00183 1 Sindhi_HGDP00206 2 1 242708729 244766624 7.41 3.468
Sindhi_HGDP00183 1 Sindhi_HGDP00206 2 1 242708729 244766624 7.41 4.468
IBS_HG01768 2 Moroccan_MRA46 1 1 34186193 36027711 30.46 3.108
IBS_HG01710 1 Sardinian_HGDP01065 2 1 246117191 249120684 7.53 3.258
IBS_HG01768 2 Moroccan_MRA46 2 1 34186193 37320967 43.4 4.418
Therefore for instance, I would need the value of column 9 for each pair to be summed. Some of these pairs appear multiple time, in this case I would need the sum of value in column 9 betweem IBS_HG01768 and Moroccan_MRA46, and the sum of the value between Sindhi_HGDP00183 and Sindhi_HGDP00206. Some of these pairs are not repeated but I still need them to appear in the final results.
What I manage so far is to sum by group (population), so I sum column 9 value by pair of population like Sindhi and Tunisian for instance. I need to do the sum by pairs of Individuals.
My script is this:
import pandas as pd
import numpy as np
import itertools
# defines columns names
cols = ['ID1', 'HAP1', 'ID2', 'HAP2', 'CHR', 'STARTPOS', 'ENDPOS', 'LOD', 'IBDLENGTH']
# loads data (the file needs to be in the same folder where the script is)
data = pd.read_csv("./Roma_Ref_All_sorted.txt", sep = '\t', names = cols)
# removes the sample ID for ID1/ID2 columns and places it in two dedicated columns
data[['ID1', 'ID1_samples']] = data['ID1'].str.split('_', expand = True)
data[['ID2', 'ID2_samples']] = data['ID2'].str.split('_', expand = True)
# gets the groups list from both ID columns...
groups_id1 = list(data.ID1.unique())
groups_id2 = list(data.ID2.unique())
groups = list(set(groups_id1 + groups_id2))
# ... and all the possible pairs
group_pairs = [i for i in itertools.combinations(groups, 2)]
# subsets the pairs having Roma
group_pairs_roma = [x for x in group_pairs if ('Roma' in x[0] and x[0] != 'Romanian') or
('Roma' in x[1] and x[1] != 'Romanian')]
# preapres output df
result = pd.DataFrame(columns = ['ID1', 'ID2', 'IBD_sum'])
# loops all the possible pairs and computes the sum of IBD length
for idx, group_pair in enumerate(group_pairs_roma):
id1 = group_pair[0]
id2 = group_pair[1]
ibd_sum = round(data.loc[((data['ID1'] == id1) & (data['ID2'] == id2)) |
((data['ID1'] == id2) & (data['ID2'] == id1)), 'IBDLENGTH'].sum(),3)
result.loc [idx, ['ID1', 'ID2', 'IBD_sum']] = [id1, id2, ibd_sum]
# saves results
result.to_csv("./groups_pairs_sum_IBD.txt", sep = '\t', index = False)
My current output is something like this:
ID1 ID2 IBD_sum
Sindhi IBS 3.275
Sindhi Moroccan 74.201
Sindhi Sindhi 119.359
While I need something like:
ID1 ID2 IBD_sum
Sindhi_individual1 Moroccan_individual1 3.275
Sindhi_individual2 Moroccan_individual2 5.275
Sindhi_individual3 IBS_individual1 4.275
I have tried by substituting one line in my code, by writing
groups_id1 = list(data.ID1_samples.unique())
groups_id2 = list(data.ID2_samples.unique())
and later
ibd_sum = round(data.loc[((data['ID1_samples'] == id1) & (data['ID2_samples'] == id2)) |
((data['ID1_samples'] == id2) & (data['ID2_samples'] == id1)), 'IBDLENGTH'].sum(),3)
Which in theory should work because I set the individuals as pairs instead of populations as pairs, but the output was empty. What could I do to edit the code for what I need?
I have solved the problem on my own but using R language. This is the code:
ibd <- read.delim("input.txt", sep='\t')
ibd_sum_indv <- ibd %>%
group_by(ID1, ID2) %>%
summarise(SIBD = sum(IBDLENGTH),
NIBD = n()) %>%
ungroup()