Search code examples
pythonbioinformaticsgenomevcf-variant-call-format

How to get genotype data of SNVs from several new bam files?


I have several SNVs:

#CHROM  POS ID REF ALT
chr1   1000  .   A   C
chr1   2000  .   T   A

Can I get their genotype data (REF coverage, ALT coverage) from some new bam files? for example, bam files from 2 tumor samples:

tumor1.bam
tumor2.bam

I expect a VCF format result like below, other formats work too.

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DR,Number=1,Type=Integer,Description="Number of reference reads">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="Number of variant reads">
#CHROM  POS ID REF ALT  FORMAT   tumor1.bam tumor2.bam
chr1   1000  .   A   C  GT:DR:DV  0/1:20:10   0/1:30:5
chr1   2000  .   T   A  GT:DR:DV  0/1:80:10   0/1:20:8

Any suggestions?


Solution

  • Thanks to @Pierre, one solution to this question was found.

    bcftools mpileup can output genotypes in a vcf format:

    bcftools mpileup -f ref.fa -R input.vcf tumor1.bam tumor2.bam -a FORMAT/AD
    

    The output will be like this:

    ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
    ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (high-quality bases)">
    #CHROM  POS ID REF     ALT  FORMAT    tumor1.bam   tumor2.bam
    chr1   1000  .   A   C,<*>   PL:AD    xx:20,10,0    xx:30,5,0
    chr1   2000  .   T   A,<*>   PL:AD    xx:80,10,0    xx:20,8,0