Search code examples
pythonbioinformaticsdna-sequencevcf-variant-call-formatsamtools

How to find the base using the chromosome number and position?


I have a list of chromosome numbers and positions which I have obtained from the output of MAC of my sample. The format looks like this:

chr1.12661672.G.A,chr1.12661721.C.T 1/6 11  2   Mutant  
chr1.157640161.C.T,chr1.157640277.G.A   1/6 11  2   Mutant 
chr1.180806049.T.A,chr1.180806061.G.C   1/6 11  11  Mutant
chr1.205929901.G.A,chr1.205930053.C.T   1/7 10  5   Mutant
...

For example for the first row I have G on position 12661672 and C on position 12661721 but I want to find out what bases are between 12661672 and 12661721. Are there any tools for bases for custom positions? I am using the GRCh38 reference genome.

The best I could think of was looking through Ensembl and finding the bases manually. Obviously this is incredibly time consuming so I would something more automated.

I am looking to build a VCF of the MNVs, so something along the lines of:

chr1 180806049 TAGTGAACAAGG AAGTGAACAAGC . PASS

Where all the positions between the identified bases are filled in.


Solution

  • Extract chromosome, start and end into a bed format file. If this is for one analysis only I would use excel otherwise I would use python. Anyway the result should luke like this

    chr1 12661672 12661721  
    chr1 157640161 157640277 
    chr1 180806049 180806061
    chr1 205929901 205930053
    

    Remember that a bed file is a tab-separated-values format

    Then, one alternative could be bedtools getfasta

    $ cat test.fa
    >chr1
    AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
    
    $ cat test.bed
    chr1 5 10
    
    $ bedtools getfasta -fi test.fa -bed test.bed
    >chr1:5-10
    AAACC
    
    # optionally write to an output file
    $ bedtools getfasta -fi test.fa -bed test.bed -fo test.fa.out
    
    $ cat test.fa.out
    >chr1:5-10
    AAACC
    

    Copied from here https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html