Search code examples
c++vcf-vcardvcf-variant-call-formatvcftoolshtslib

How to extract genotype information for each sample as a string from a VCF file using htslib?


I am using htslib for extracting all the information contained in a VCF file in C++.

Currently, thanks to the VCF specification and the documentation in the file vcf.h, I have successfully extracted all the metadata information in the header (Meta-Information Lines), and most of the information contained in each row of the body of the file (Data Lines).

However, I don't know how to extract the genotype information (sample columns).

I am using example files from the 1000G project. This is an example of two rows of the file, it shows the Format field and two samples (The file has more than 1000 samples per each row, I would like to extract the data for all of them):

FORMAT      HG00096                         HG00097
GT:DS:GL    0|0:0.050:-0.48,-0.48,-0.48     0|0:0.050:-0.24,-0.40,-1.49
GT:DS:GL    0|0:0.000:-0.10,-0.69,-4.70     0|0:0.000:-0.05,-0.94,-5.00

I know that this is a heavy task that would take some computation time. I have extracted the names of each column (HG00096, HG00077...), but I don't know how to extract the information of each sample either as a full string (e.g., "0|0:0.050:-0.48,-0.48,-0.48"), as a set (array, map, vector...) of key-value pairs (e.g., [("GT", "0|0"), ("DS", "0.050"), ("GL", "-0.48,-0.48,-0.48")), or simply as an array of values (e.g., ["0|0", "0.050", "-0.48,-0.48,-0.48"]. I would like to do this for each sample.

I have been reading the documentation in the vcf.h file and I think that the function bcf_get_genotypes(hdr,line,dst,ndst) may be suitable for this, but I don't know for sure how to use it for extracting the values as strings. Also, I think that this information may be stored inside the 'p' pointer of 'bcf_fmt_t', but I don't know for sure, it just contains an array set of uint8_t values and I don't know if a string (or char array) can be extracted in the way I want.

typedef struct bcf_fmt_t {
    int id;             
    int n, size, type;  
    uint8_t *p;        
    uint32_t p_len;
    uint32_t p_off:31, p_free:1;
} bcf_fmt_t;

Is there a way of doing that I am trying to do?


Solution

  • I finally figured it out. There are some functions for doing this, depending on the type specified in the header for the format ID: the functions are inside of the vcf.h file in htslib:

    #define bcf_get_format_int32(hdr,line,tag,dst,ndst)  
    bcf_get_format_values(hdr,line,tag,(void**)(dst),ndst,BCF_HT_INT)
    #define bcf_get_format_float(hdr,line,tag,dst,ndst)  
    bcf_get_format_values(hdr,line,tag,(void**)(dst),ndst,BCF_HT_REAL)
    #define bcf_get_format_char(hdr,line,tag,dst,ndst)   
    bcf_get_format_values(hdr,line,tag,(void**)(dst),ndst,BCF_HT_STR)
    #define bcf_get_genotypes(hdr,line,dst,ndst)         
    bcf_get_format_values(hdr,line,"GT",(void**)(dst),ndst,BCF_HT_INT)
    
    HTSLIB_EXPORT
    int bcf_get_format_string(const bcf_hdr_t *hdr, bcf1_t *line, const char *tag, char ***dst, int *ndst);
    
    HTSLIB_EXPORT
    int bcf_get_format_values(const bcf_hdr_t *hdr, bcf1_t *line, const char *tag, void **dst, int *ndst, int type);