Search code examples
rbioinformaticsvcf-variant-call-format

Convert vcf to hap file, collapse genotypes


I use photos to ask a question.

I want my output file only have number/line/sample.

How can I remove newdata...C...and ""

Any suggestions?

Note: I am the first time to ask a question in here. I am try to follow the rules. I am still studying.


Solution

  • Try this example:

    library(data.table)
    
    # example vcf
    hap <- fread('
    ##fileformat=VCFv4.0
    ##fileDate=20090805
    ##source=myImputationProgramV3.1
    ##reference=1000GenomesPilot-NCBI36
    ##phasing=partial
    ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
    ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
    ##FILTER=<ID=q10,Description="Quality below 10">
    ##FILTER=<ID=s50,Description="Less than 50% of samples have data">
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
    19  111 .   A   C   9.6 .   .   GT  0|0 0|0 0|1
    19  112 .   A   G   10  .   .   GT  0|0 1|0 1|1
    19  112 .   A   G   4   .   .   GT  1|0 1|0 1|1
    ')
    
    data.table(gsub("|", "", do.call(paste0, hap[, -c(1:9)]), fixed = TRUE))
    #        V1
    # 1: 000001
    # 2: 001011
    # 3: 101011