I want to extract information from the txt file to dataframe with the following field in the data
1) GENEINFO
2) ID
3) POS
4) ALT
5) CLNSIG
6) CLNDN
I wrote the following code trying to get information from the file, but don't know how to proceed. Could you help me guide through some ideas to do that?
import io
import os
import pandas as pd
def read_vcf(path):
with open('clinvar_final.txt', 'r') as f:
lines = [l for l in f if not l.startswith('##')]
return pd.read_csv(
io.StringIO(''.join(lines)),
dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
'QUAL': str, 'FILTER': str, 'INFO': str},
sep='\t'
).rename(columns={'#CHROM': 'CHROM'})
You can read it with
df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')
and after that you will have tabel already with columns 2) ID
3) POS
4) ALT
print(df[['ID', 'POS', 'ALT']].head())
gives
ID POS ALT
0 475283 1014O42 A
1 542074 1O14122 T
2 183381 1014143 T
3 542075 1014179 T
4 475278 1014217 T
Other information ( 1) GENEINFO
5) CLNSIG
6) CLNDN
) are in column INFO
as one string and you can extra them to separated columns using regex
df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')
print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())
Result
0 ISG15:9636
1 ISG15:9636
2 ISG15:9636
3 ISG15:9636
4 ISG15:9636
Name: GENEINFO, dtype: object
0 Benign
1 Uncertain_significance
2 Pathogenic
3 Uncertain_significance
4 Benign
Name: CLNSIG, dtype: object
0 Immunodeficiency_38_with_basal_ganglia_calcifi...
1 Immunodeficiency_38_with_basal_ganglia_calcifi...
2 Immunodeficiency_38_with_basal_ganglia_calcifi...
3 Immunodeficiency_38_with_basal_ganglia_calcifi...
4 Immunodeficiency_38_with_basal_ganglia_calcifi...
Name: CLNDN, dtype: object
import pandas as pd
df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')
print(df.columns)
print(df[['ID', 'POS', 'ALT']].head())
df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')
print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())