I am new to coding and am currently learning python. Your expertise is greatly appreciated. I have a tab separated data file, tsv.tab, that has 30,000 comma-separated entries in columns two and three but only 1 entry in column one. Each column is separated by a tab. The file looks like so:
57 A,C,G,C ID1,ID2,ID3,ID4
2079 G,C,A,T ID1,ID2,ID3,ID4
5270 C,T,T,G ID1,ID2,ID3,ID4
Please, can you suggest a python method to read in the data, transpose,merge (column 3, i.e. ID1,..) and save the data as a tab or comma-separated file? I will gladly accept pure python, pandas, or numpy suggestions. The output should look like so:
57 2079 5270 ID
A G C ID1
C C T ID2
G A T ID3
C T G ID4
or
57,2079,5270,ID
A,G,C,ID1
C,C,T,ID2
G,A,T,ID3
C,T,G,ID4
So far, I have tried to use pandas as follows.
import pandas as pd
df = pd.read_csv('tsv.tab', delimiter= '\t', header=None)
df.transpose().to_csv('tsv.csv', header = False, index=False)
Suppose you save your sample data to a file called 'test.txt', you can do:
df = pd.read_csv('test.txt',header=None,sep='\s+')
result = (
df[1].str.split(',',expand=True)
.T
.pipe(lambda x: x.rename(columns=dict(zip(x.columns,df[0]))))
.assign(ID=df[2].iloc[0].split(','))
)
result
Out[119]:
57 2079 5270 ID
0 A G C ID1
1 C C T ID2
2 G A T ID3
3 C T G ID4