Search code examples
pythonpandasnumpytransposearray-merge

Transpose and merge a tab-separated file by column using Python Pandas


I am new to coding and am currently learning python. Your expertise is greatly appreciated. I have a tab separated data file, tsv.tab, that has 30,000 comma-separated entries in columns two and three but only 1 entry in column one. Each column is separated by a tab. The file looks like so:

57      A,C,G,C ID1,ID2,ID3,ID4  
2079    G,C,A,T ID1,ID2,ID3,ID4     
5270    C,T,T,G ID1,ID2,ID3,ID4

Please, can you suggest a python method to read in the data, transpose,merge (column 3, i.e. ID1,..) and save the data as a tab or comma-separated file? I will gladly accept pure python, pandas, or numpy suggestions. The output should look like so:

57  2079   5270 ID   
A   G      C    ID1  
C   C      T    ID2  
G   A      T    ID3  
C   T      G    ID4  

or

57,2079,5270,ID 
A,G,C,ID1  
C,C,T,ID2  
G,A,T,ID3  
C,T,G,ID4  

So far, I have tried to use pandas as follows.

import pandas as pd  
df = pd.read_csv('tsv.tab', delimiter= '\t', header=None)
df.transpose().to_csv('tsv.csv', header = False, index=False)  

Solution

  • Suppose you save your sample data to a file called 'test.txt', you can do:

    df = pd.read_csv('test.txt',header=None,sep='\s+')
    
    result = (
        df[1].str.split(',',expand=True)
        .T
        .pipe(lambda x: x.rename(columns=dict(zip(x.columns,df[0]))))
        .assign(ID=df[2].iloc[0].split(','))
        )
    
    result
    Out[119]: 
      57 2079 5270   ID
    0  A    G    C  ID1
    1  C    C    T  ID2
    2  G    A    T  ID3
    3  C    T    G  ID4