parsing txt files in vcfs format

I want to extract information from the txt file to dataframe with the following field in the data

1) GENEINFO
2) ID
3) POS
4) ALT
5) CLNSIG
6) CLNDN

The txt file is here

I wrote the following code trying to get information from the file, but don't know how to proceed. Could you help me guide through some ideas to do that?

import io
import os
import pandas as pd


def read_vcf(path):
    with open('clinvar_final.txt', 'r') as f:
        lines = [l for l in f if not l.startswith('##')]
    return pd.read_csv(
        io.StringIO(''.join(lines)),
        dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
    ).rename(columns={'#CHROM': 'CHROM'})

Solution

You can read it with

df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')

and after that you will have tabel already with columns 2) ID 3) POS 4) ALT

print(df[['ID', 'POS', 'ALT']].head())

gives

       ID      POS ALT
0  475283  1014O42   A
1  542074  1O14122   T
2  183381  1014143   T
3  542075  1014179   T
4  475278  1014217   T

Other information ( 1) GENEINFO 5) CLNSIG 6) CLNDN) are in column INFO as one string and you can extra them to separated columns using regex

df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')

print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())

Result

0    ISG15:9636
1    ISG15:9636
2    ISG15:9636
3    ISG15:9636
4    ISG15:9636
Name: GENEINFO, dtype: object

0                    Benign
1    Uncertain_significance
2                Pathogenic
3    Uncertain_significance
4                    Benign
Name: CLNSIG, dtype: object

0    Immunodeficiency_38_with_basal_ganglia_calcifi...
1    Immunodeficiency_38_with_basal_ganglia_calcifi...
2    Immunodeficiency_38_with_basal_ganglia_calcifi...
3    Immunodeficiency_38_with_basal_ganglia_calcifi...
4    Immunodeficiency_38_with_basal_ganglia_calcifi...
Name: CLNDN, dtype: object

import pandas as pd

df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')

print(df.columns)

print(df[['ID', 'POS', 'ALT']].head())

df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')

print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())