Search code examples
pythonpandasvcf-variant-call-format

parsing txt files in vcfs format


I want to extract information from the txt file to dataframe with the following field in the data

1) GENEINFO
2) ID
3) POS
4) ALT
5) CLNSIG
6) CLNDN 

The txt file is here

I wrote the following code trying to get information from the file, but don't know how to proceed. Could you help me guide through some ideas to do that?

import io
import os
import pandas as pd


def read_vcf(path):
    with open('clinvar_final.txt', 'r') as f:
        lines = [l for l in f if not l.startswith('##')]
    return pd.read_csv(
        io.StringIO(''.join(lines)),
        dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
    ).rename(columns={'#CHROM': 'CHROM'})

Solution

  • You can read it with

    df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')
    

    and after that you will have tabel already with columns 2) ID 3) POS 4) ALT

    print(df[['ID', 'POS', 'ALT']].head())
    

    gives

           ID      POS ALT
    0  475283  1014O42   A
    1  542074  1O14122   T
    2  183381  1014143   T
    3  542075  1014179   T
    4  475278  1014217   T
    

    Other information ( 1) GENEINFO 5) CLNSIG 6) CLNDN) are in column INFO as one string and you can extra them to separated columns using regex

    df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
    df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
    df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')
    
    print(df['GENEINFO'].head())
    print(df['CLNSIG'].head())
    print(df['CLNDN'].head())
    

    Result

    0    ISG15:9636
    1    ISG15:9636
    2    ISG15:9636
    3    ISG15:9636
    4    ISG15:9636
    Name: GENEINFO, dtype: object
    
    0                    Benign
    1    Uncertain_significance
    2                Pathogenic
    3    Uncertain_significance
    4                    Benign
    Name: CLNSIG, dtype: object
    
    0    Immunodeficiency_38_with_basal_ganglia_calcifi...
    1    Immunodeficiency_38_with_basal_ganglia_calcifi...
    2    Immunodeficiency_38_with_basal_ganglia_calcifi...
    3    Immunodeficiency_38_with_basal_ganglia_calcifi...
    4    Immunodeficiency_38_with_basal_ganglia_calcifi...
    Name: CLNDN, dtype: object
    

    import pandas as pd
    
    df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')
    
    print(df.columns)
    
    print(df[['ID', 'POS', 'ALT']].head())
    
    df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
    df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
    df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')
    
    print(df['GENEINFO'].head())
    print(df['CLNSIG'].head())
    print(df['CLNDN'].head())