In the following data:
M1 M2 M3 M4 M5 M6 M7 M8 Hx Hy S1 S2 S3 S4
A T T A A G A C A C C G C T
A T T A A G A C A C C G C T
T G C T G T T G T A A T A T
C A A C A G T C C G G A C G
G T G T A T C T G T C T T T
the following code was used:
d1 = d1.add('g').add(d1.shift()).dropna()
to get:
M1 M2 M3 M4 M5 M6 M7 M8 H0 H1 S1 S2 S3 S4
AgA TgT TgT AgA AgA GgG AgA CgC AgA CgC CgC GgG CgC TgT
TgA GgT CgT TgA GgA TgG TgA GgC TgA AgC AgC TgG AgC TgT
CgT AgG AgC CgT AgG GgT TgT CgG CgT GgA GgA AgT CgA GgT
GgC TgA GgA TgC AgA TgG CgT TgC GgC TgG CgG TgA TgC TgG
But, if the data is of following structure:
M1 M2 M3 M4 Hx Hy S1 S2 pos
A/T T/A A/G G/G A C C/G C/T 2
A/T T/A A/G G/G G T C/G C/T 12
T/G C/T G/T T/G C G T/T T/T 16
T/T T/T T/T T|T G T T/T T/T 17
I instead want the combination of all possible letter (between previous and current line) for each column except for pos
.
So, it would be like:
M1 M2 Hx Hy S1 S2
AgA,AgT,TgA,TgT TgT,TgA,AgT,AgA AgA TgC CgC,CgG,GgC,GgG CgC,CgT,TgC,TgT
TgA,TgT,GgA,GgT ....
so on for all other line
I am adding a matrix to understand the process:
values from previous line in m1 (at pos 12)
A T
value from next T TgA TgT
next line pos 16 -> G GgA GgT
I tried to use itertools to keep values in each row as list of dictionary:
for row in d1_group.iterrows():
index, data = row
temp.append(data.tolist())
print(temp)
next, thought is to use index (or pos) as keys and then create combinations between adjacent index (or pos) values.
Any possibility doing this using pandas or dictionary.
Thanks,
Preamble:
import itertools as it
list(it.product(['A'], ['T']))
Out[229]: [('A', 'T')]
list(it.product(['A', 'T'], ['T', 'G']))
Out[230]: [('A', 'T'), ('A', 'G'), ('T', 'T'), ('T', 'G')]
','.join('g'.join(t) for t in it.product(['A'], ['T']))
Out[231]: 'AgT'
','.join('g'.join(t) for t in it.product(['T', 'G'],['A', 'T']))
Out[233]: 'TgA,TgT,GgA,GgT'
So let's build a dataframe that contains this:
df=df.applymap(lambda c: [[c]])
df
Out[258]:
M1 M2 M3 M4 M5 M6 M7 M8 Hx Hy \
0 [[A]] [[T]] [[T]] [[A]] [[A]] [[G]] [[A]] [[C]] [[A]] [[C]]
1 [[A]] [[T]] [[T]] [[A]] [[A]] [[G]] [[A]] [[C]] [[A]] [[C]]
2 [[T]] [[G]] [[C]] [[T]] [[G]] [[T]] [[T]] [[G]] [[T]] [[A]]
3 [[C]] [[A]] [[A]] [[C]] [[A]] [[G]] [[T]] [[C]] [[C]] [[G]]
4 [[G]] [[T]] [[G]] [[T]] [[A]] [[T]] [[C]] [[T]] [[G]] [[T]]
(df+df.shift(1)).dropna(how='all').applymap(lambda c: ','.join('g'.join(t)
for t in it.product(*c)))
Out[266]:
M1 M2 M3 M4 M5 M6 M7 M8 Hx Hy S1 S2 S3 S4
1 AgA TgT TgT AgA AgA GgG AgA CgC AgA CgC CgC GgG CgC TgT
2 TgA GgT CgT TgA GgA TgG TgA GgC TgA AgC AgC TgG AgC TgT
3 CgT AgG AgC CgT AgG GgT TgT CgG CgT GgA GgA AgT CgA GgT
4 GgC TgA GgA TgC AgA TgG CgT TgC GgC TgG CgG TgA TgC TgG
Now the same for the couples with just a bit more of cleanup/preparation:
df.set_index('pos', inplace=True)
df
Out[273]:
M1 M2 M3 M4 Hx Hy S1 S2
pos
2 A/T T/A A/G G/G A C C/G C/T
12 A/T T/A A/G G/G G T C/G C/T
16 T/G C/T G/T T/G C G T/T T/T
17 T/T T/T T/T T|T G T T/T T/T
df = df.applymap(lambda c: [c.split('/')])
df
Out[274]:
M1 M2 M3 M4 Hx Hy S1 S2
pos
2 [[A, T]] [[T, A]] [[A, G]] [[G, G]] [[A]] [[C]] [[C, G]] [[C, T]]
12 [[A, T]] [[T, A]] [[A, G]] [[G, G]] [[G]] [[T]] [[C, G]] [[C, T]]
16 [[T, G]] [[C, T]] [[G, T]] [[T, G]] [[C]] [[G]] [[T, T]] [[T, T]]
17 [[T, T]] [[T, T]] [[T, T]] [[T|T]] [[G]] [[T]] [[T, T]] [[T, T]]
(df+df.shift(1)).dropna(how='all').applymap(lambda c: ','.join('g'.join(t) for t in it.product(*c)))
Out[276]:
M1 M2 M3 M4 Hx \
pos
12 AgA,AgT,TgA,TgT TgT,TgA,AgT,AgA AgA,AgG,GgA,GgG GgG,GgG,GgG,GgG GgA
16 TgA,TgT,GgA,GgT CgT,CgA,TgT,TgA GgA,GgG,TgA,TgG TgG,TgG,GgG,GgG CgG
17 TgT,TgG,TgT,TgG TgC,TgT,TgC,TgT TgG,TgT,TgG,TgT T|TgT,T|TgG GgC
Hy S1 S2
pos
12 TgC CgC,CgG,GgC,GgG CgC,CgT,TgC,TgT
16 GgT TgC,TgG,TgC,TgG TgC,TgT,TgC,TgT
17 TgG TgT,TgT,TgT,TgT TgT,TgT,TgT,TgT
You can now reset the index and get pos
back. You might need adjustement by shifting it and align it appropriately.