Search code examples
pythonbioinformaticsbiopythonfastabed

Python: Extract DNA sequence from FASTA file using Bed file


May I know how can I extract dna sequence from fasta file? I tried bedtools and samtools. Bedtools getfasta did well but for some of my file return "warning: chromosome was not found in fasta file" but the fact is the chromosome name in bed file and fasta are exactly the same. I'm looking for other alternative that python can do this task for me.

Bed file:
chr1:117223140-117223856 3 7
chr1:117223140-117223856 5 9

Fasta file:
>chr1:117223140-117223856
CGCGTGGGCTAGGGGCTAGCCCC

Desired output:
>chr1:117223140-117223856
CGTGG
>chr1:117223140-117223856
TGGGC


Solution

  • BioPython is what you want to use:

    from Bio import SeqIO
    from Bio.Seq import Seq
    from Bio.SeqRecord import SeqRecord
    from collections import defaultdict
    
    # read names and postions from bed file
    positions = defaultdict(list)
    with open('positions.bed') as f:
        for line in f:
            name, start, stop = line.split()
            positions[name].append((int(start), int(stop)))
    
    # parse faste file and turn into dictionary
    records = SeqIO.to_dict(SeqIO.parse(open('sequences.fasta'), 'fasta'))
    
    # search for short sequences
    short_seq_records = []
    for name in positions:
        for (start, stop) in positions[name]:
            long_seq_record = records[name]
            long_seq = long_seq_record.seq
            alphabet = long_seq.alphabet
            short_seq = str(long_seq)[start-1:stop]
            short_seq_record = SeqRecord(Seq(short_seq, alphabet), id=name, description='')
            short_seq_records.append(short_seq_record)
    
    # write to file
    with open('output.fasta', 'w') as f:
        SeqIO.write(short_seq_records, f, 'fasta')