Search code examples
pythonbioinformaticsextractfasta

How to randomly extract FASTA sequences using Python?


I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I randomly extract the sequences. For example I would like to randomly select 2 sequences out of the total sequences. There are tools provided to do so is to extract according to percentage but not the number of sequences. Can anyone help me?

A.fasta

>chr1:1310706-1310726
GACGGTTTCCGGTTAGTGGAA
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG
>chr1:983001-983021
GTCCGCTTGCGGGACCTGGGG
>chr1:984333-984353
CTGGAATTCCGGGCGCTGGAG
>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT

Expected Output

>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG

Solution

  • If you are working with fasta files use BioPython, to get n sequences use random.sample:

    from Bio import SeqIO
    from random import sample
    with open("foo.fasta") as f:
        seqs = SeqIO.parse(f,"fasta")
        print(sample(list(seqs), 2))
    

    Output:

    [SeqRecord(seq=Seq('GAGATCGTCCGGGACCTGGGT', SingleLetterAlphabet()), id='chr1:1154147-1154167', name='chr1:1154147-1154167', description='chr1:1154147-1154167', dbxrefs=[]), SeqRecord(seq=Seq('GTCCGCTTGCGGGACCTGGGG', SingleLetterAlphabet()), id='chr1:983001-983021', name='chr1:983001-983021', description='chr1:983001-983021', dbxrefs=[])]
    

    You can extract the strings if necessary:

     print([(seq.name,str(seq.seq)) for seq in  sample(list(seqs),2)])
     [('chr1:1310706-1310726', 'GACGGTTTCCGGTTAGTGGAA'), ('chr1:983001-983021', 'GTCCGCTTGCGGGACCTGGGG')]
    

    If the lines were always in pairs and you skipped the metadata at the top you could zip:

    from random import sample
    
    with open("foo.fasta") as f:
        print(sample(list(zip(f, f)), 2))
    

    Which will give you pairs of lines in tuples:

    [('>chr1:983001-983021\n', 'GTCCGCTTGCGGGACCTGGGG\n'), ('>chr1:984333-984353\n', 'CTGGAATTCCGGGCGCTGGAG\n')]
    

    To get the lines ready to be written:

    from Bio import SeqIO
    from random import sample
    with open("foo.fasta") as f:
        seqs = SeqIO.parse(f, "fasta")
        samps = ((seq.name, seq.seq) for seq in  sample(list(seqs),2))
        for samp in samps:
            print(">{}\n{}".format(*samp))
    

    Output:

    >chr1:1310706-1310726
    GACGGTTTCCGGTTAGTGGAA
    >chr1:983001-983021
    GTCCGCTTGCGGGACCTGGGG