Search code examples
biopython

What can I do add this counts, and make right dataframe?


I have some problem for using Biopython, count and sum each base's numbers for parsing FASTA file. In FASTA file, total A is how much? and total T is?

but there's some problem.

1.

handle2="/home/koreanraichu/sra_data_mo.fasta"
for record2 in SeqIO.parse(handle2,"fasta"):
    print(Seq(record2.seq).count("A"))
    print(type(Seq(record2.seq).count("A")))

This is code, was it successfully read sequence and count adenine, but It never summarize each numbers. I tried it for list append and sum(), simply add but there's no effective. (each count type is int, but never added and printed separately)

for record2 in SeqIO.parse(handle2,"fasta"):
    if len(record2.seq) > 100:
        i=0
        i=i+len(record2.seq)
    else:
        j=0
        j=j+len(record2.seq)
print(i,j)

like upper, this code doesn't work. I meant this code for It is a conditional sum code that adds DNA of 100 bp or more and DNA of less than 100 bp separately. but it never works, too. it prints last record's data.

What can I do things for solve this?


Solution

  • try this code for first problem:

    from Bio import SeqIO
    
    # from Bio.Seq import Seq
    
    handle2="Fasta.fa"
    for record2 in SeqIO.parse(handle2,"fasta"):
        
        # print(record2.seq, type(record2.seq))
        
        # print(str(record2.seq), type(str(record2.seq)))
        
        print(record2.seq.count("A"))
        # print(type(record2.seq).count("A"))  ### --> TypeError: count() missing 1 required positional argument: 'sub'
        
        summarize = 0
        for i in 'ATGC':
              x = record2.seq.count(i)
              print(i, '  :  ', x)
              summarize += record2.seq.count(i)
              
        print(summarize)
    

    given my test fasta :

    >Rosalind_4402
    GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
    GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
    
    

    output:

    27
    A   :   27
    T   :   32
    G   :   32
    C   :   29
    120
    

    second code :

    from Bio import SeqIO
    
    # from Bio.Seq import Seq
    
    
    # handle2="/home/koreanraichu/sra_data_mo.fasta"
    
    
    handle2="Fasta2.fa"
    
    i=0
    j=0
    for record2 in SeqIO.parse(handle2,"fasta"):
        if len(record2.seq) > 100:
            print('>100 : ', len(record2.seq))
            i=i+len(record2.seq)
        else:
            print('else : ', len(record2.seq))
            j=j+len(record2.seq)
    
    print('> 100 summarize : ', i, ' else summarize : ',j)
    

    given test fasta:

    >Rosalind_4402
    GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
    GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
    >Rosalind_4403
    GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
    GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
    GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
    GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
    >Rosalind_4404
    GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
    >Rosalind_4405
    GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATT
    >Rosalind_4406
    GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
    GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
    CTTTCAGCTGTAAGAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCT
    GAGGGGCTATCTT
    >Rosalind_4407
    GCAGCTAGCTAGCTAGCTGGGATT
    >Rosalind_4408
    GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
    GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
    CTTTCAGCTGTAAGAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGC
    
    

    output:

    >100 :  120
    >100 :  240
    else :  60
    else :  47
    >100 :  193
    else :  24
    >100 :  179
    > 100 summarize :  732  else summarize :  131