Search code examples
bioinformaticsvcf-variant-call-format

Ignoring Sample Information when Making VCF with Freebayes


I have single-cell RNA-Seq experiment with >1000 samples from the same individual. To get an idea of their genotype, I want to do a pseudo-bulk variant call where I merge the 1000+ BAM files into one and call variants from that. The problem I have is that Freebayes will give each sample from the merged VCF file a separate column rather than one "merged" column. I couldn't find any options on Freebayes that ignore read groups.

Normal output:

<header info>
#CHROM   POS   ID   REF   ALT   QUAL   FILTER   INFO   FORMAT   SAMPLE 1   SAMPLE 2 ...
<data>

Desired output:

<header info>
#CHROM   POS   ID   REF   ALT   QUAL   FILTER   INFO   FORMAT   COMBINED_SAMPLE
<data>

Solution

  • While not elegant, a solution I have found is to modify the header of the BAM file so that every read group maps to the same sample. A quick way of doing this with sed and SAMtools:

    samtools reheader -c "sed 's/\bSM:[^ ]*/SM:combined/'" input.bam > output.bam
    

    In this example, samtools reheader modifies the BAM file's header using the command passed with -c. The sed command matches any space-separated word that starts with "SM:" and replaces it with SM:combined:

    Before
    @RG   ID:read_group_1   LB:RNA   PL:Illumina   SM:sample_A
    @RG   ID:read_group_2   LB:RNA   PL:Illumina   SM:sample_B
    
    After
    @RG   ID:read_group_1   LB:RNA   PL:Illumina   SM:combined
    @RG   ID:read_group_2   LB:RNA   PL:Illumina   SM:combined
    

    When running Freebayes on the new BAM file, all metadata is contained within a single sample column named "combined" (copied here):

    <header info>
    #CHROM   POS   ID   REF   ALT   QUAL   FILTER   INFO   FORMAT   combined
    <data>