Search code examples
pythonpandasnumpydictionarypython-itertools

How to read two lines in a data from same column to create combination of values from that column?


In the following data:

M1  M2  M3  M4  M5  M6  M7  M8  Hx Hy    S1    S2    S3    S4
A   T   T   A   A   G   A   C   A   C    C     G     C     T
A   T   T   A   A   G   A   C   A   C    C     G     C     T
T   G   C   T   G   T   T   G   T   A    A     T     A     T
C   A   A   C   A   G   T   C   C   G    G     A     C     G
G   T   G   T   A   T   C   T   G   T    C     T     T     T

the following code was used:

d1 = d1.add('g').add(d1.shift()).dropna()

to get:

M1   M2   M3   M4   M5   M6   M7   M8   H0   H1   S1   S2   S3   S4                                                                         
AgA  TgT  TgT  AgA  AgA  GgG  AgA  CgC  AgA  CgC  CgC  GgG  CgC  TgT   
TgA  GgT  CgT  TgA  GgA  TgG  TgA  GgC  TgA  AgC  AgC  TgG  AgC  TgT   
CgT  AgG  AgC  CgT  AgG  GgT  TgT  CgG  CgT  GgA  GgA  AgT  CgA  GgT   
GgC  TgA  GgA  TgC  AgA  TgG  CgT  TgC  GgC  TgG  CgG  TgA  TgC  TgG 

But, if the data is of following structure:

M1   M2   M3  M4     Hx  Hy   S1  S2        pos  
A/T  T/A  A/G  G/G    A    C    C/G  C/T    2
A/T  T/A  A/G  G/G    G    T    C/G  C/T    12
T/G  C/T  G/T  T/G    C    G    T/T  T/T    16
T/T  T/T  T/T  T|T    G    T    T/T  T/T    17

I instead want the combination of all possible letter (between previous and current line) for each column except for pos.

So, it would be like:

M1                M2               Hx    Hy      S1                S2                                               
AgA,AgT,TgA,TgT  TgT,TgA,AgT,AgA   AgA   TgC   CgC,CgG,GgC,GgG    CgC,CgT,TgC,TgT
TgA,TgT,GgA,GgT ....
so on for all other line

I am adding a matrix to understand the process:

values from previous line in m1 (at pos 12)
                                  A       T
value from next            T     TgA     TgT
next line  pos 16 ->       G     GgA     GgT

I tried to use itertools to keep values in each row as list of dictionary:

for row in d1_group.iterrows():
    index, data = row
    temp.append(data.tolist())
print(temp)

next, thought is to use index (or pos) as keys and then create combinations between adjacent index (or pos) values.

Any possibility doing this using pandas or dictionary.

Thanks,


Solution

  • Preamble:

    import itertools as it
    
    list(it.product(['A'], ['T']))
    Out[229]: [('A', 'T')]
    
    list(it.product(['A', 'T'], ['T', 'G']))
    Out[230]: [('A', 'T'), ('A', 'G'), ('T', 'T'), ('T', 'G')]
    
    ','.join('g'.join(t) for t in it.product(['A'], ['T']))
    Out[231]: 'AgT'
    
    ','.join('g'.join(t) for t in it.product(['T', 'G'],['A', 'T']))
    Out[233]: 'TgA,TgT,GgA,GgT'
    

    So let's build a dataframe that contains this:

    df=df.applymap(lambda c: [[c]])
    
    df
    Out[258]: 
          M1     M2     M3     M4     M5     M6     M7     M8     Hx     Hy  \
    0  [[A]]  [[T]]  [[T]]  [[A]]  [[A]]  [[G]]  [[A]]  [[C]]  [[A]]  [[C]]   
    1  [[A]]  [[T]]  [[T]]  [[A]]  [[A]]  [[G]]  [[A]]  [[C]]  [[A]]  [[C]]   
    2  [[T]]  [[G]]  [[C]]  [[T]]  [[G]]  [[T]]  [[T]]  [[G]]  [[T]]  [[A]]   
    3  [[C]]  [[A]]  [[A]]  [[C]]  [[A]]  [[G]]  [[T]]  [[C]]  [[C]]  [[G]]   
    4  [[G]]  [[T]]  [[G]]  [[T]]  [[A]]  [[T]]  [[C]]  [[T]]  [[G]]  [[T]]  
    
    (df+df.shift(1)).dropna(how='all').applymap(lambda c: ','.join('g'.join(t)
                                                          for t in it.product(*c)))
    Out[266]: 
        M1   M2   M3   M4   M5   M6   M7   M8   Hx   Hy   S1   S2   S3   S4
    1  AgA  TgT  TgT  AgA  AgA  GgG  AgA  CgC  AgA  CgC  CgC  GgG  CgC  TgT
    2  TgA  GgT  CgT  TgA  GgA  TgG  TgA  GgC  TgA  AgC  AgC  TgG  AgC  TgT
    3  CgT  AgG  AgC  CgT  AgG  GgT  TgT  CgG  CgT  GgA  GgA  AgT  CgA  GgT
    4  GgC  TgA  GgA  TgC  AgA  TgG  CgT  TgC  GgC  TgG  CgG  TgA  TgC  TgG
    

    Now the same for the couples with just a bit more of cleanup/preparation:

    df.set_index('pos', inplace=True)
    
    df
    Out[273]: 
          M1   M2   M3   M4 Hx Hy   S1   S2
    pos                                    
    2    A/T  T/A  A/G  G/G  A  C  C/G  C/T
    12   A/T  T/A  A/G  G/G  G  T  C/G  C/T
    16   T/G  C/T  G/T  T/G  C  G  T/T  T/T
    17   T/T  T/T  T/T  T|T  G  T  T/T  T/T
    
    df = df.applymap(lambda c: [c.split('/')])
    df
    Out[274]: 
               M1        M2        M3        M4     Hx     Hy        S1        S2
    pos                                                                          
    2    [[A, T]]  [[T, A]]  [[A, G]]  [[G, G]]  [[A]]  [[C]]  [[C, G]]  [[C, T]]
    12   [[A, T]]  [[T, A]]  [[A, G]]  [[G, G]]  [[G]]  [[T]]  [[C, G]]  [[C, T]]
    16   [[T, G]]  [[C, T]]  [[G, T]]  [[T, G]]  [[C]]  [[G]]  [[T, T]]  [[T, T]]
    17   [[T, T]]  [[T, T]]  [[T, T]]   [[T|T]]  [[G]]  [[T]]  [[T, T]]  [[T, T]]
    
    
    
    (df+df.shift(1)).dropna(how='all').applymap(lambda c: ','.join('g'.join(t) for t in it.product(*c)))
    Out[276]: 
                      M1               M2               M3               M4   Hx  \
    pos                                                                            
    12   AgA,AgT,TgA,TgT  TgT,TgA,AgT,AgA  AgA,AgG,GgA,GgG  GgG,GgG,GgG,GgG  GgA   
    16   TgA,TgT,GgA,GgT  CgT,CgA,TgT,TgA  GgA,GgG,TgA,TgG  TgG,TgG,GgG,GgG  CgG   
    17   TgT,TgG,TgT,TgG  TgC,TgT,TgC,TgT  TgG,TgT,TgG,TgT      T|TgT,T|TgG  GgC   
    
          Hy               S1               S2  
    pos                                         
    12   TgC  CgC,CgG,GgC,GgG  CgC,CgT,TgC,TgT  
    16   GgT  TgC,TgG,TgC,TgG  TgC,TgT,TgC,TgT  
    17   TgG  TgT,TgT,TgT,TgT  TgT,TgT,TgT,TgT  
    

    You can now reset the index and get pos back. You might need adjustement by shifting it and align it appropriately.