python bioinformatics dna-sequence vcf-variant-call-format samtools

How to find the base using the chromosome number and position?

I have a list of chromosome numbers and positions which I have obtained from the output of MAC of my sample. The format looks like this:

chr1.12661672.G.A,chr1.12661721.C.T 1/6 11  2   Mutant  
chr1.157640161.C.T,chr1.157640277.G.A   1/6 11  2   Mutant 
chr1.180806049.T.A,chr1.180806061.G.C   1/6 11  11  Mutant
chr1.205929901.G.A,chr1.205930053.C.T   1/7 10  5   Mutant
...

For example for the first row I have G on position 12661672 and C on position 12661721 but I want to find out what bases are between 12661672 and 12661721. Are there any tools for bases for custom positions? I am using the GRCh38 reference genome.

The best I could think of was looking through Ensembl and finding the bases manually. Obviously this is incredibly time consuming so I would something more automated.

I am looking to build a VCF of the MNVs, so something along the lines of:

chr1 180806049 TAGTGAACAAGG AAGTGAACAAGC . PASS

Where all the positions between the identified bases are filled in.

Solution

Extract chromosome, start and end into a bed format file. If this is for one analysis only I would use excel otherwise I would use python. Anyway the result should luke like this

chr1 12661672 12661721  
chr1 157640161 157640277 
chr1 180806049 180806061
chr1 205929901 205930053

Remember that a bed file is a tab-separated-values format

Then, one alternative could be bedtools getfasta

$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed
chr1 5 10

$ bedtools getfasta -fi test.fa -bed test.bed
>chr1:5-10
AAACC

# optionally write to an output file
$ bedtools getfasta -fi test.fa -bed test.bed -fo test.fa.out

$ cat test.fa.out
>chr1:5-10
AAACC

Copied from here https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html